:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liu, Dong, Yu, Yanxuan
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2512.11920
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
von: Shukla, Shikhar
Veröffentlicht: (2026)

PiKV: KV Cache Management System for Mixture of Experts
von: Liu, Dong, et al.
Veröffentlicht: (2025)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
von: Liu, Dong, et al.
Veröffentlicht: (2025)

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
von: Tiwari, Rishabh, et al.
Veröffentlicht: (2025)

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
von: Liu, Zedong, et al.
Veröffentlicht: (2026)

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
von: Kampeas, Joseph, et al.
Veröffentlicht: (2026)

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
von: Zhong, Zhiqing, et al.
Veröffentlicht: (2026)

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
von: Yoon, Dongha, et al.
Veröffentlicht: (2025)

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
von: Guo, Yipin, et al.
Veröffentlicht: (2026)

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
von: Feng, Shaoting, et al.
Veröffentlicht: (2025)

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
von: Li, Weiqing, et al.
Veröffentlicht: (2025)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
von: Yüzügüler, Ahmet Caner, et al.
Veröffentlicht: (2025)

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
von: Liu, Guangda, et al.
Veröffentlicht: (2025)

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
von: Chen, Kaiwen, et al.
Veröffentlicht: (2025)

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
von: Xiang, Xingyu, et al.
Veröffentlicht: (2025)

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
von: Liu, Guangda, et al.
Veröffentlicht: (2024)

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
von: Liu, Yuhan, et al.
Veröffentlicht: (2024)

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
von: Kim, Dowon, et al.
Veröffentlicht: (2025)

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
von: Liu, Hongyao, et al.
Veröffentlicht: (2026)

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
von: Cai, Zefan, et al.
Veröffentlicht: (2024)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
von: Tian, Yuxuan, et al.
Veröffentlicht: (2025)

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
von: Xia, Tianhua, et al.
Veröffentlicht: (2025)

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
von: Dehghanighobadi, Zahra, et al.
Veröffentlicht: (2026)

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
von: Cai, Zefan, et al.
Veröffentlicht: (2025)

CoKV: Optimizing KV Cache Allocation via Cooperative Game
von: Sun, Qiheng, et al.
Veröffentlicht: (2025)

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
von: Geng, Yingsheng, et al.
Veröffentlicht: (2026)

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
von: Jiang, Bo, et al.
Veröffentlicht: (2025)

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
von: Xiong, Yi, et al.
Veröffentlicht: (2024)

Comparative Characterization of KV Cache Management Strategies for LLM Inference
von: Mamo, Oteo, et al.
Veröffentlicht: (2026)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
von: Kumar, Satyam, et al.
Veröffentlicht: (2026)

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
von: Zhu, Yuxuan, et al.
Veröffentlicht: (2025)

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
von: Zhao, Yi, et al.
Veröffentlicht: (2025)

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
von: Feng, Yuan, et al.
Veröffentlicht: (2024)

The Pitfalls of KV Cache Compression
von: Chen, Alex, et al.
Veröffentlicht: (2025)

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
von: Liu, Sihao, et al.
Veröffentlicht: (2026)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
von: Wang, Yixuan, et al.
Veröffentlicht: (2025)

G-KV: Decoding-Time KV Cache Eviction with Global Attention
von: Liao, Mengqi, et al.
Veröffentlicht: (2025)

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
von: Chen, Chuangtao, et al.
Veröffentlicht: (2026)

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity
von: Hao, Jitai, et al.
Veröffentlicht: (2026)

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
von: Dong, Yanhao, et al.
Veröffentlicht: (2025)