:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Long, Lingkun, Huang, Yushi, Bai, Shihao, Gong, Ruihao, Zhang, Jun, Zhou, Ao, Yang, Jianlei
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.02159
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
by: Long, Lingkun, et al.
Published: (2025)

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
by: Song, Yuerong, et al.
Published: (2025)

dLLM: Simple Diffusion Language Modeling
by: Zhou, Zhanhui, et al.
Published: (2026)

LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
by: Xu, Chenkai, et al.
Published: (2025)

Fast-dLLM v2: Efficient Block-Diffusion LLM
by: Wu, Chengyue, et al.
Published: (2025)

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
by: Liu, Zhiyuan, et al.
Published: (2025)

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
by: Wu, Chengyue, et al.
Published: (2025)

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
by: Xiao, Zhongyu, et al.
Published: (2026)

FocusLLM: Precise Understanding of Long Context by Dynamic Condensing
by: Li, Zhenyu, et al.
Published: (2024)

Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference
by: Huang, Jianuo, et al.
Published: (2025)

Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
by: Du, Zhenbang, et al.
Published: (2026)

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)

Squeezed Attention: Accelerating Long Context Length LLM Inference
by: Hooper, Coleman, et al.
Published: (2024)

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
by: Liu, Di, et al.
Published: (2024)

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
by: Huang, Yuxiang, et al.
Published: (2025)

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
by: Wu, Junyi, et al.
Published: (2026)

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
by: He, Zhuomin, et al.
Published: (2025)

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
by: Lin, Gang, et al.
Published: (2026)

An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding
by: Wu, Tong, et al.
Published: (2024)

Reducing Distraction in Long-Context Language Models by Focused Learning
by: Wu, Zijun, et al.
Published: (2024)

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)

LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
by: Gu, Zhuohan, et al.
Published: (2024)

Unstructured Evidence Attribution for Long Context Query Focused Summarization
by: Wright, Dustin, et al.
Published: (2025)

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
by: Huang, Yushi, et al.
Published: (2025)

MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads
by: Liu, Weihao, et al.
Published: (2025)

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
by: Zhang, Hengran, et al.
Published: (2025)

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
by: Lu, Guanxi, et al.
Published: (2025)

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024)

Accelerating Diffusion LLM Inference via Local Determinism Propagation
by: Kong, Fanheng, et al.
Published: (2025)

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
by: Zhang, Wuwei, et al.
Published: (2025)

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
by: Zhu, Qianchao, et al.
Published: (2024)

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions
by: Hu, Taojun, et al.
Published: (2024)

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
by: Mai, Tho, et al.
Published: (2026)

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
by: Tang, Jiaming, et al.
Published: (2024)

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
by: Tao, Wei, et al.
Published: (2025)

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
by: Xiao, Guangxuan, et al.
Published: (2024)

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
by: Pan, Xiurui, et al.
Published: (2024)

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation
by: Chen, Junyi, et al.
Published: (2025)