Saved in:
| Main Authors: | Li, Wenhao, Zhang, Yuxin, Luo, Gen, Wan, Haiyuan, Gong, Ziyang, Chao, Fei, Ji, Rongrong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.19740 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
by: Li, Wenhao, et al.
Published: (2026)
by: Li, Wenhao, et al.
Published: (2026)
Training Long-Context LLMs Efficiently via Chunk-wise Optimization
by: Li, Wenhao, et al.
Published: (2025)
by: Li, Wenhao, et al.
Published: (2025)
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)
by: Liu, Guangda, et al.
Published: (2025)
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
by: Liu, Minghui, et al.
Published: (2024)
by: Liu, Minghui, et al.
Published: (2024)
Towards Efficient Automatic Self-Pruning of Large Language Models
by: Huang, Weizhong, et al.
Published: (2025)
by: Huang, Weizhong, et al.
Published: (2025)
FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference
by: Wang, Dongwei, et al.
Published: (2025)
by: Wang, Dongwei, et al.
Published: (2025)
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)
by: Guo, Jinyu, et al.
Published: (2026)
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)
by: Gu, Yifeng, et al.
Published: (2025)
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study
by: Zhao, Lirui, et al.
Published: (2023)
by: Zhao, Lirui, et al.
Published: (2023)
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
by: Zhao, Zhan, et al.
Published: (2026)
by: Zhao, Zhan, et al.
Published: (2026)
Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)
by: Hu, Jie, et al.
Published: (2025)
HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval
by: He, Chao, et al.
Published: (2024)
by: He, Chao, et al.
Published: (2024)
G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)
by: Liao, Mengqi, et al.
Published: (2025)
Motion-Aware Caching for Efficient Autoregressive Video Generation
by: Xu, Jing, et al.
Published: (2026)
by: Xu, Jing, et al.
Published: (2026)
ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)
by: Zhang, Qiuyang, et al.
Published: (2026)
Beyond KV Caching: Shared Attention for Efficient LLMs
by: Liao, Bingli, et al.
Published: (2024)
by: Liao, Bingli, et al.
Published: (2024)
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
by: Wu, Wenbo, et al.
Published: (2025)
by: Wu, Wenbo, et al.
Published: (2025)
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
by: Huang, Zhaohong, et al.
Published: (2026)
by: Huang, Zhaohong, et al.
Published: (2026)
Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval
by: Xu, Binxiao, et al.
Published: (2025)
by: Xu, Binxiao, et al.
Published: (2025)
Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
by: Wang, Zihan, et al.
Published: (2026)
by: Wang, Zihan, et al.
Published: (2026)
ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
by: Wang, Shihao, et al.
Published: (2026)
by: Wang, Shihao, et al.
Published: (2026)
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
by: Ji, Yicheng, et al.
Published: (2026)
by: Ji, Yicheng, et al.
Published: (2026)
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
by: Takbir, Nazmul, et al.
Published: (2025)
by: Takbir, Nazmul, et al.
Published: (2025)
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
by: Gong, Ziyang, et al.
Published: (2025)
by: Gong, Ziyang, et al.
Published: (2025)
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)
by: Ji, Shiyu, et al.
Published: (2026)
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
by: Dehghankar, Mohsen, et al.
Published: (2026)
by: Dehghankar, Mohsen, et al.
Published: (2026)
Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference
by: Feng, Yiding, et al.
Published: (2026)
by: Feng, Yiding, et al.
Published: (2026)
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
by: Liu, Hongyao, et al.
Published: (2026)
by: Liu, Hongyao, et al.
Published: (2026)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)
by: Tian, Yuxuan, et al.
Published: (2025)
ToolCaching: Towards Efficient Caching for LLM Tool-calling
by: Zhai, Yi, et al.
Published: (2026)
by: Zhai, Yi, et al.
Published: (2026)
KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
by: Xu, Yichun, et al.
Published: (2026)
by: Xu, Yichun, et al.
Published: (2026)
VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
by: Wang, Zhican, et al.
Published: (2025)
by: Wang, Zhican, et al.
Published: (2025)
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)
by: Nian, Sean, et al.
Published: (2026)
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
by: Chen, Yilong, et al.
Published: (2025)
by: Chen, Yilong, et al.
Published: (2025)
Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
by: Li, Hanchen, et al.
Published: (2025)
by: Li, Hanchen, et al.
Published: (2025)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
Similar Items
-
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
by: Li, Wenhao, et al.
Published: (2026) -
Training Long-Context LLMs Efficiently via Chunk-wise Optimization
by: Li, Wenhao, et al.
Published: (2025) -
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025) -
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
by: Liu, Minghui, et al.
Published: (2024) -
Towards Efficient Automatic Self-Pruning of Large Language Models
by: Huang, Weizhong, et al.
Published: (2025)