Saved in:
| Main Authors: | Mao, Weian, Lin, Xi, Huang, Wei, Xie, Yuxin, Fu, Tianfu, Zhuang, Bohan, Han, Song, Chen, Yukang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04921 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
by: Chen, Yukang, et al.
Published: (2026)
by: Chen, Yukang, et al.
Published: (2026)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
by: Chen, Zhuokun, et al.
Published: (2026)
by: Chen, Zhuokun, et al.
Published: (2026)
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)
by: Su, Yi, et al.
Published: (2026)
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
by: Ji, Yicheng, et al.
Published: (2026)
by: Ji, Yicheng, et al.
Published: (2026)
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
by: Zhang, Haoyue, et al.
Published: (2025)
by: Zhang, Haoyue, et al.
Published: (2025)
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
by: Li, Xiaolong, et al.
Published: (2025)
by: Li, Xiaolong, et al.
Published: (2025)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
by: Liu, Akide, et al.
Published: (2024)
by: Liu, Akide, et al.
Published: (2024)
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
by: Rehg, Isaac
Published: (2024)
by: Rehg, Isaac
Published: (2024)
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
by: Wu, Wei, et al.
Published: (2024)
by: Wu, Wei, et al.
Published: (2024)
HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models
by: Zhuang, Shuhan, et al.
Published: (2025)
by: Zhuang, Shuhan, et al.
Published: (2025)
NestedKV: Nested Memory Routing for Long-Context KV Cache Compression
by: Chen, Hong, et al.
Published: (2026)
by: Chen, Hong, et al.
Published: (2026)
LongVLM: Efficient Long Video Understanding via Large Language Models
by: Weng, Yuetian, et al.
Published: (2024)
by: Weng, Yuetian, et al.
Published: (2024)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
WeGeFT: Weight-Generative Fine-Tuning for Multi-Faceted Efficient Adaptation of Large Models
by: Savadikar, Chinmay, et al.
Published: (2023)
by: Savadikar, Chinmay, et al.
Published: (2023)
SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
by: Li, Mengjie, et al.
Published: (2025)
by: Li, Mengjie, et al.
Published: (2025)
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs
by: Peng, Junjie, et al.
Published: (2026)
by: Peng, Junjie, et al.
Published: (2026)
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)
by: Ma, Da, et al.
Published: (2024)
DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity
by: Hao, Jitai, et al.
Published: (2026)
by: Hao, Jitai, et al.
Published: (2026)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
by: Zhou, Yuhao, et al.
Published: (2025)
by: Zhou, Yuhao, et al.
Published: (2025)
Lossless KV Cache Compression to 2%
by: Yang, Zhen, et al.
Published: (2024)
by: Yang, Zhen, et al.
Published: (2024)
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
by: Huang, Kai, et al.
Published: (2025)
by: Huang, Kai, et al.
Published: (2025)
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
by: Cai, Zefan, et al.
Published: (2025)
by: Cai, Zefan, et al.
Published: (2025)
G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)
by: Liao, Mengqi, et al.
Published: (2025)
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
by: Zhou, Xiabin, et al.
Published: (2024)
by: Zhou, Xiabin, et al.
Published: (2024)
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
by: Li, Guihong, et al.
Published: (2025)
by: Li, Guihong, et al.
Published: (2025)
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)
by: Ji, Shiyu, et al.
Published: (2026)
ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
by: Dong, Zican, et al.
Published: (2026)
by: Dong, Zican, et al.
Published: (2026)
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
by: Chen, Yukang, et al.
Published: (2023)
by: Chen, Yukang, et al.
Published: (2023)
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
by: Fei, Weizhi, et al.
Published: (2025)
by: Fei, Weizhi, et al.
Published: (2025)
Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling
by: Qin, Ziran, et al.
Published: (2025)
by: Qin, Ziran, et al.
Published: (2025)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
by: Liu, Minghui, et al.
Published: (2025)
by: Liu, Minghui, et al.
Published: (2025)
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
by: Liu, Xin, et al.
Published: (2025)
by: Liu, Xin, et al.
Published: (2025)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)
by: Yang, Qingyue, et al.
Published: (2025)
GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting
by: Cao, Yukang, et al.
Published: (2024)
by: Cao, Yukang, et al.
Published: (2024)
Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
by: Song, Jiwon, et al.
Published: (2025)
by: Song, Jiwon, et al.
Published: (2025)
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
by: Devoto, Alessio, et al.
Published: (2025)
by: Devoto, Alessio, et al.
Published: (2025)
Similar Items
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
by: Chen, Yukang, et al.
Published: (2026) -
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025) -
FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
by: Chen, Zhuokun, et al.
Published: (2026) -
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024) -
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)