Saved in:
| Main Authors: | Souibgui, Mohamed Ali, Fostier, Jan, Abadía-Heredia, Rodrigo, Denysenko, Bohdan, Marschke, Christian, Peric, Igor |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.22050 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
by: Naim, Omar, et al.
Published: (2025)
by: Naim, Omar, et al.
Published: (2025)
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
by: Zarch, Hossein Entezari, et al.
Published: (2025)
by: Zarch, Hossein Entezari, et al.
Published: (2025)
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)
by: Qiu, Quantong, et al.
Published: (2026)
High-Layer Attention Pruning with Rescaling
by: Liu, Songtao, et al.
Published: (2025)
by: Liu, Songtao, et al.
Published: (2025)
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
by: Fu, Zizhuo, et al.
Published: (2026)
by: Fu, Zizhuo, et al.
Published: (2026)
Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers
by: Wong, Liang Ze
Published: (2025)
by: Wong, Liang Ze
Published: (2025)
Multi-Layer Attention is the Amplifier of Demonstration Effectiveness
by: Wang, Dingzirui, et al.
Published: (2025)
by: Wang, Dingzirui, et al.
Published: (2025)
DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
by: Tan, Zhen, et al.
Published: (2024)
by: Tan, Zhen, et al.
Published: (2024)
Depth-Wise Attention (DWAtt): A Layer Fusion Method for Data-Efficient Classification
by: ElNokrashy, Muhammad, et al.
Published: (2022)
by: ElNokrashy, Muhammad, et al.
Published: (2022)
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
by: Zarch, Hossein Entezari, et al.
Published: (2025)
by: Zarch, Hossein Entezari, et al.
Published: (2025)
An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction
by: Hamdi, Ali, et al.
Published: (2025)
by: Hamdi, Ali, et al.
Published: (2025)
When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models
by: Sanyal, Sunny, et al.
Published: (2024)
by: Sanyal, Sunny, et al.
Published: (2024)
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
by: Kapadia, Shashank, et al.
Published: (2026)
by: Kapadia, Shashank, et al.
Published: (2026)
CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
by: McDanel, Bradley, et al.
Published: (2026)
by: McDanel, Bradley, et al.
Published: (2026)
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
by: Musat, Tiberiu
Published: (2024)
by: Musat, Tiberiu
Published: (2024)
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
by: Shi, Zhenmei, et al.
Published: (2024)
by: Shi, Zhenmei, et al.
Published: (2024)
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models
by: Gu, Jian, et al.
Published: (2024)
by: Gu, Jian, et al.
Published: (2024)
DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies
by: Yang, Ning, et al.
Published: (2025)
by: Yang, Ning, et al.
Published: (2025)
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
by: Achtibat, Reduan, et al.
Published: (2024)
by: Achtibat, Reduan, et al.
Published: (2024)
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
by: Bozic, Vukasin, et al.
Published: (2023)
by: Bozic, Vukasin, et al.
Published: (2023)
Dr.LLM: Dynamic Layer Routing in LLMs
by: Heakl, Ahmed, et al.
Published: (2025)
by: Heakl, Ahmed, et al.
Published: (2025)
Not All Layers of LLMs Are Necessary During Inference
by: Fan, Siqi, et al.
Published: (2024)
by: Fan, Siqi, et al.
Published: (2024)
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
by: Brandon, William, et al.
Published: (2024)
by: Brandon, William, et al.
Published: (2024)
Wave-PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention
by: Vejendla, Harshil
Published: (2025)
by: Vejendla, Harshil
Published: (2025)
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)
by: Bai, Yushi, et al.
Published: (2026)
Efficient LLM Moderation with Multi-Layer Latent Prototypes
by: Chrabąszcz, Maciej, et al.
Published: (2025)
by: Chrabąszcz, Maciej, et al.
Published: (2025)
Calibration Across Layers: Understanding Calibration Evolution in LLMs
by: Joshi, Abhinav, et al.
Published: (2025)
by: Joshi, Abhinav, et al.
Published: (2025)
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
by: Jiang, Jingzhou, et al.
Published: (2026)
by: Jiang, Jingzhou, et al.
Published: (2026)
Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs
by: Yang, Zhipeng, et al.
Published: (2025)
by: Yang, Zhipeng, et al.
Published: (2025)
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
by: Goel, Raghavv, et al.
Published: (2026)
by: Goel, Raghavv, et al.
Published: (2026)
Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features
by: Bombari, Simone, et al.
Published: (2024)
by: Bombari, Simone, et al.
Published: (2024)
Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
by: Filipek, Adam
Published: (2025)
by: Filipek, Adam
Published: (2025)
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
by: Song, Zhuo-Yang, et al.
Published: (2025)
by: Song, Zhuo-Yang, et al.
Published: (2025)
Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge
by: Chen, Yan-Lun, et al.
Published: (2025)
by: Chen, Yan-Lun, et al.
Published: (2025)
TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
by: Hu, Lanxiang, et al.
Published: (2024)
by: Hu, Lanxiang, et al.
Published: (2024)
Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
by: Elgabry, Menna, et al.
Published: (2025)
by: Elgabry, Menna, et al.
Published: (2025)
Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness
by: Jelenić, Fran, et al.
Published: (2023)
by: Jelenić, Fran, et al.
Published: (2023)
Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
by: Ehab, Mohamed, et al.
Published: (2026)
by: Ehab, Mohamed, et al.
Published: (2026)
Iterative Layer-wise Distillation for Efficient Compression of Large Language Models
by: Kovalev, Grigory, et al.
Published: (2025)
by: Kovalev, Grigory, et al.
Published: (2025)
Towards Building Efficient Sentence BERT Models using Layer Pruning
by: Shelke, Anushka, et al.
Published: (2024)
by: Shelke, Anushka, et al.
Published: (2024)
Similar Items
-
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
by: Naim, Omar, et al.
Published: (2025) -
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
by: Zarch, Hossein Entezari, et al.
Published: (2025) -
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026) -
High-Layer Attention Pruning with Rescaling
by: Liu, Songtao, et al.
Published: (2025) -
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
by: Fu, Zizhuo, et al.
Published: (2026)