:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Huawei, Xia, Chunwei, Wang, Zheng
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.11907
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025)

Leyline: KV Cache Directives for Agentic Inference
by: Ma, Bole, et al.
Published: (2026)

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
by: Li, Weiqing, et al.
Published: (2025)

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
by: Jiang, Bo, et al.
Published: (2025)

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
by: Lei, Jianlong, et al.
Published: (2026)

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
by: Jiang, Bo, et al.
Published: (2025)

A Survey on Large Language Model Acceleration based on KV Cache Management
by: Li, Haoyang, et al.
Published: (2024)

PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)

DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones
by: Wang, Tuowei, et al.
Published: (2025)

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)

SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
by: Zhang, Ziyang, et al.
Published: (2025)

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
by: Wang, Jiahao, et al.
Published: (2025)

KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)

Inference Offloading for Cost-Sensitive Binary Classification at the Edge
by: Moothedath, Vishnu Narayanan, et al.
Published: (2025)

A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing
by: Wang, Xin, et al.
Published: (2025)

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

Joint Resource Optimization, Computation Offloading and Resource Slicing for Multi-Edge Traffic-Cognitive Networks
by: Xiaoyang, Ting, et al.
Published: (2024)

TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
by: Papaioannou, Konstantinos, et al.
Published: (2026)

MatKV: Trading Compute for Flash Storage in LLM Inference
by: Shin, Kun-Woo, et al.
Published: (2025)

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
by: Yildiz, Mert, et al.
Published: (2026)

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
by: Yuan, Yichao, et al.
Published: (2026)

Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA
by: Li, Allison, et al.
Published: (2025)

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
by: Wang, Shiju, et al.
Published: (2025)

InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling
by: Jiang, Xiaoxiao, et al.
Published: (2025)

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
by: Zhou, Zhongzhu, et al.
Published: (2026)

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
by: Deshmukh, Dhruv, et al.
Published: (2025)

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
by: Zhu, Yue, et al.
Published: (2025)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)