Saved in:
| Main Authors: | Long, Lingkun, Huang, Yushi, Bai, Shihao, Gong, Ruihao, Zhang, Jun, Zhou, Ao, Yang, Jianlei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.02159 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
by: Long, Lingkun, et al.
Published: (2025)
by: Long, Lingkun, et al.
Published: (2025)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
by: Song, Yuerong, et al.
Published: (2025)
by: Song, Yuerong, et al.
Published: (2025)
dLLM: Simple Diffusion Language Modeling
by: Zhou, Zhanhui, et al.
Published: (2026)
by: Zhou, Zhanhui, et al.
Published: (2026)
LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
by: Xu, Chenkai, et al.
Published: (2025)
by: Xu, Chenkai, et al.
Published: (2025)
Fast-dLLM v2: Efficient Block-Diffusion LLM
by: Wu, Chengyue, et al.
Published: (2025)
by: Wu, Chengyue, et al.
Published: (2025)
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
by: Liu, Zhiyuan, et al.
Published: (2025)
by: Liu, Zhiyuan, et al.
Published: (2025)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
by: Wu, Chengyue, et al.
Published: (2025)
by: Wu, Chengyue, et al.
Published: (2025)
Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
by: Xiao, Zhongyu, et al.
Published: (2026)
by: Xiao, Zhongyu, et al.
Published: (2026)
FocusLLM: Precise Understanding of Long Context by Dynamic Condensing
by: Li, Zhenyu, et al.
Published: (2024)
by: Li, Zhenyu, et al.
Published: (2024)
Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference
by: Huang, Jianuo, et al.
Published: (2025)
by: Huang, Jianuo, et al.
Published: (2025)
Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)
by: Hu, Jie, et al.
Published: (2025)
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
by: Du, Zhenbang, et al.
Published: (2026)
by: Du, Zhenbang, et al.
Published: (2026)
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
Squeezed Attention: Accelerating Long Context Length LLM Inference
by: Hooper, Coleman, et al.
Published: (2024)
by: Hooper, Coleman, et al.
Published: (2024)
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
by: Liu, Di, et al.
Published: (2024)
by: Liu, Di, et al.
Published: (2024)
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
by: Huang, Yuxiang, et al.
Published: (2025)
by: Huang, Yuxiang, et al.
Published: (2025)
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
by: Wu, Junyi, et al.
Published: (2026)
by: Wu, Junyi, et al.
Published: (2026)
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)
by: Guo, Jinyu, et al.
Published: (2026)
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
by: He, Zhuomin, et al.
Published: (2025)
by: He, Zhuomin, et al.
Published: (2025)
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
by: Lin, Gang, et al.
Published: (2026)
by: Lin, Gang, et al.
Published: (2026)
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding
by: Wu, Tong, et al.
Published: (2024)
by: Wu, Tong, et al.
Published: (2024)
Reducing Distraction in Long-Context Language Models by Focused Learning
by: Wu, Zijun, et al.
Published: (2024)
by: Wu, Zijun, et al.
Published: (2024)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
by: Gu, Zhuohan, et al.
Published: (2024)
by: Gu, Zhuohan, et al.
Published: (2024)
Unstructured Evidence Attribution for Long Context Query Focused Summarization
by: Wright, Dustin, et al.
Published: (2025)
by: Wright, Dustin, et al.
Published: (2025)
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
by: Huang, Yushi, et al.
Published: (2025)
by: Huang, Yushi, et al.
Published: (2025)
MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads
by: Liu, Weihao, et al.
Published: (2025)
by: Liu, Weihao, et al.
Published: (2025)
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
by: Zhang, Hengran, et al.
Published: (2025)
by: Zhang, Hengran, et al.
Published: (2025)
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
by: Lu, Guanxi, et al.
Published: (2025)
by: Lu, Guanxi, et al.
Published: (2025)
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024)
by: Fu, Qichen, et al.
Published: (2024)
Accelerating Diffusion LLM Inference via Local Determinism Propagation
by: Kong, Fanheng, et al.
Published: (2025)
by: Kong, Fanheng, et al.
Published: (2025)
Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
by: Zhang, Wuwei, et al.
Published: (2025)
by: Zhang, Wuwei, et al.
Published: (2025)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
by: Zhu, Qianchao, et al.
Published: (2024)
by: Zhu, Qianchao, et al.
Published: (2024)
Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions
by: Hu, Taojun, et al.
Published: (2024)
by: Hu, Taojun, et al.
Published: (2024)
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
by: Mai, Tho, et al.
Published: (2026)
by: Mai, Tho, et al.
Published: (2026)
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
by: Tang, Jiaming, et al.
Published: (2024)
by: Tang, Jiaming, et al.
Published: (2024)
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
by: Tao, Wei, et al.
Published: (2025)
by: Tao, Wei, et al.
Published: (2025)
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
by: Xiao, Guangxuan, et al.
Published: (2024)
by: Xiao, Guangxuan, et al.
Published: (2024)
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
by: Pan, Xiurui, et al.
Published: (2024)
by: Pan, Xiurui, et al.
Published: (2024)
Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation
by: Chen, Junyi, et al.
Published: (2025)
by: Chen, Junyi, et al.
Published: (2025)
Similar Items
-
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
by: Long, Lingkun, et al.
Published: (2025) -
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
by: Song, Yuerong, et al.
Published: (2025) -
dLLM: Simple Diffusion Language Modeling
by: Zhou, Zhanhui, et al.
Published: (2026) -
LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
by: Xu, Chenkai, et al.
Published: (2025) -
Fast-dLLM v2: Efficient Block-Diffusion LLM
by: Wu, Chengyue, et al.
Published: (2025)