:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Wenhao, Zhang, Yuxin, Luo, Gen, Wan, Haiyuan, Gong, Ziyang, Chao, Fei, Ji, Rongrong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2508.19740
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
by: Li, Wenhao, et al.
Published: (2026)

Training Long-Context LLMs Efficiently via Chunk-wise Optimization
by: Li, Wenhao, et al.
Published: (2025)

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
by: Liu, Minghui, et al.
Published: (2024)

Towards Efficient Automatic Self-Pruning of Large Language Models
by: Huang, Weizhong, et al.
Published: (2025)

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference
by: Wang, Dongwei, et al.
Published: (2025)

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)

AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)

Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study
by: Zhao, Lirui, et al.
Published: (2023)

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
by: Zhao, Zhan, et al.
Published: (2026)

Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)

HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval
by: He, Chao, et al.
Published: (2024)

G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)

Motion-Aware Caching for Efficient Autoregressive Video Generation
by: Xu, Jing, et al.
Published: (2026)

ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)

Beyond KV Caching: Shared Attention for Efficient LLMs
by: Liao, Bingli, et al.
Published: (2024)

LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
by: Wu, Wenbo, et al.
Published: (2025)

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
by: Huang, Zhaohong, et al.
Published: (2026)

Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval
by: Xu, Binxiao, et al.
Published: (2025)

Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
by: Wang, Zihan, et al.
Published: (2026)

ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
by: Wang, Shihao, et al.
Published: (2026)

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
by: Ji, Yicheng, et al.
Published: (2026)

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
by: Takbir, Nazmul, et al.
Published: (2025)

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
by: Gong, Ziyang, et al.
Published: (2025)

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
by: Dehghankar, Mohsen, et al.
Published: (2026)

Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference
by: Feng, Yiding, et al.
Published: (2026)

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
by: Liu, Hongyao, et al.
Published: (2026)

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)

ToolCaching: Towards Efficient Caching for LLM Tool-calling
by: Zhai, Yi, et al.
Published: (2026)

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
by: Xu, Yichun, et al.
Published: (2026)

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
by: Wang, Zhican, et al.
Published: (2025)

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
by: Chen, Yilong, et al.
Published: (2025)

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
by: Li, Hanchen, et al.
Published: (2025)

EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2024)

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)