Saved in:
| Main Authors: | Xu, Yixing, Nag, Shivank, Li, Dong, Tian, Lu, Barsoum, Emad |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.01039 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
by: Li, Jinze, et al.
Published: (2025)
by: Li, Jinze, et al.
Published: (2025)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
by: Liao, Huanxuan, et al.
Published: (2025)
by: Liao, Huanxuan, et al.
Published: (2025)
Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation
by: Lee, Nakyung, et al.
Published: (2025)
by: Lee, Nakyung, et al.
Published: (2025)
Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference
by: Li, Zeping, et al.
Published: (2024)
by: Li, Zeping, et al.
Published: (2024)
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
by: An, Zihao, et al.
Published: (2026)
by: An, Zihao, et al.
Published: (2026)
Partial Convolution Meets Visual Attention
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
by: Ray, Pretam, et al.
Published: (2026)
by: Ray, Pretam, et al.
Published: (2026)
Sliding Window Attention Training for Efficient Large Language Models
by: Fu, Zichuan, et al.
Published: (2025)
by: Fu, Zichuan, et al.
Published: (2025)
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
by: Mishra, Prakamya, et al.
Published: (2025)
by: Mishra, Prakamya, et al.
Published: (2025)
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
by: Wang, Shuai, et al.
Published: (2025)
by: Wang, Shuai, et al.
Published: (2025)
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
by: Bae, Jeongin, et al.
Published: (2026)
by: Bae, Jeongin, et al.
Published: (2026)
CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections
by: Kim, Keuntae, et al.
Published: (2025)
by: Kim, Keuntae, et al.
Published: (2025)
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
by: Lee, Hanna, et al.
Published: (2026)
by: Lee, Hanna, et al.
Published: (2026)
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models
by: Ke, Wenjin, et al.
Published: (2025)
by: Ke, Wenjin, et al.
Published: (2025)
SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing
by: Yu, Yijiong, et al.
Published: (2025)
by: Yu, Yijiong, et al.
Published: (2025)
Efficient Context Scaling with LongCat ZigZag Attention
by: Zhang, Chen, et al.
Published: (2025)
by: Zhang, Chen, et al.
Published: (2025)
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
by: Goldstein, Daniel, et al.
Published: (2025)
by: Goldstein, Daniel, et al.
Published: (2025)
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
by: Tian, Yuxing, et al.
Published: (2026)
by: Tian, Yuxing, et al.
Published: (2026)
Scaling Reasoning without Attention
by: Zhao, Xueliang, et al.
Published: (2025)
by: Zhao, Xueliang, et al.
Published: (2025)
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
by: Wang, Jianghui, et al.
Published: (2025)
by: Wang, Jianghui, et al.
Published: (2025)
Attention-Aligned Reasoning for Large Language Models
by: Zhang, Hongxiang, et al.
Published: (2025)
by: Zhang, Hongxiang, et al.
Published: (2025)
Self-Selected Attention Span for Accelerating Large Language Model Inference
by: Jin, Tian, et al.
Published: (2024)
by: Jin, Tian, et al.
Published: (2024)
Reflection-Window Decoding: Text Generation with Selective Refinement
by: Tang, Zeyu, et al.
Published: (2025)
by: Tang, Zeyu, et al.
Published: (2025)
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
by: An, Wenbin, et al.
Published: (2024)
by: An, Wenbin, et al.
Published: (2024)
When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
by: Shamsi, Zafir, et al.
Published: (2026)
by: Shamsi, Zafir, et al.
Published: (2026)
CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending
by: Zhu, Shiyi, et al.
Published: (2023)
by: Zhu, Shiyi, et al.
Published: (2023)
ReAttention: Training-Free Infinite Context with Finite Attention Scope
by: Liu, Xiaoran, et al.
Published: (2024)
by: Liu, Xiaoran, et al.
Published: (2024)
Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)
by: Xiao, Guangxuan, et al.
Published: (2023)
Block-Attention for Efficient Prefilling
by: Ma, Dongyang, et al.
Published: (2024)
by: Ma, Dongyang, et al.
Published: (2024)
Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment
by: Han, Hong, et al.
Published: (2026)
by: Han, Hong, et al.
Published: (2026)
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
by: Cheng, Zhen, et al.
Published: (2026)
by: Cheng, Zhen, et al.
Published: (2026)
LoRA-Mini : Adaptation Matrices Decomposition and Selective Training
by: Singh, Ayush, et al.
Published: (2024)
by: Singh, Ayush, et al.
Published: (2024)
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
by: Xiao, Yang, et al.
Published: (2025)
by: Xiao, Yang, et al.
Published: (2025)
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
by: Chen, Hao, et al.
Published: (2025)
by: Chen, Hao, et al.
Published: (2025)
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
by: Huang, Yuxiang, et al.
Published: (2026)
by: Huang, Yuxiang, et al.
Published: (2026)
Stacked from One: Multi-Scale Self-Injection for Context Window Extension
by: Han, Wei, et al.
Published: (2026)
by: Han, Wei, et al.
Published: (2026)
LongHeads: Multi-Head Attention is Secretly a Long Context Processor
by: Lu, Yi, et al.
Published: (2024)
by: Lu, Yi, et al.
Published: (2024)
Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates
by: Xu, Yixing, et al.
Published: (2025)
by: Xu, Yixing, et al.
Published: (2025)
LLM as Attention-Informed NTM and Topic Modeling as long-input Generation: Interpretability and long-Context Capability
by: Xu, Xuan, et al.
Published: (2025)
by: Xu, Xuan, et al.
Published: (2025)
Similar Items
-
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
by: Li, Jinze, et al.
Published: (2025) -
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
by: Huang, Haiduo, et al.
Published: (2025) -
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
by: Liao, Huanxuan, et al.
Published: (2025) -
Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation
by: Lee, Nakyung, et al.
Published: (2025) -
Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference
by: Li, Zeping, et al.
Published: (2024)