:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Liu, Guangda, Li, Chengwei, Zhao, Jieru, Zhang, Chenqi, Guo, Minyi
Formato:	Preprint
Publicado:	2024
Materias:	Machine Learning Artificial Intelligence Performance
Acceso en línea:	https://arxiv.org/abs/2412.03213
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
por: Liu, Guangda, et al.
Publicado: (2025)

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
por: Zhang, Hang, et al.
Publicado: (2025)

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
por: Zhao, Youpeng, et al.
Publicado: (2024)

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
por: Barad, Haim, et al.
Publicado: (2023)

Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
por: Liu, Minghui, et al.
Publicado: (2025)

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
por: Ma, Xinyue, et al.
Publicado: (2026)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
por: Zandieh, Amir, et al.
Publicado: (2024)

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
por: Zhou, Zhongzhu, et al.
Publicado: (2026)

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
por: Liu, Hongyao, et al.
Publicado: (2026)

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
por: Zhao, Yi, et al.
Publicado: (2025)

Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models
por: Sai, Vishnu, et al.
Publicado: (2026)

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
por: Liu, Minghui, et al.
Publicado: (2024)

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
por: Lin, Yujun, et al.
Publicado: (2024)

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
por: Ning, Zhenyu, et al.
Publicado: (2024)

GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
por: Taneja, Maanas, et al.
Publicado: (2026)

The Pitfalls of KV Cache Compression
por: Chen, Alex, et al.
Publicado: (2025)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
por: Fang, Yunhua, et al.
Publicado: (2025)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
por: Wang, Yixuan, et al.
Publicado: (2025)

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
por: Liu, Andy Zeyi, et al.
Publicado: (2026)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
por: Tian, Yuxuan, et al.
Publicado: (2025)

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
por: Zhu, Yuxuan, et al.
Publicado: (2025)

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
por: Feng, Shaoting, et al.
Publicado: (2025)

ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
por: Yan, Xianglong, et al.
Publicado: (2025)

EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training
por: Yi, Qingao, et al.
Publicado: (2025)

KVSculpt: KV Cache Compression as Distillation
por: Jiang, Bo, et al.
Publicado: (2026)

CoKV: Optimizing KV Cache Allocation via Cooperative Game
por: Sun, Qiheng, et al.
Publicado: (2025)

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
por: Kang, Hao, et al.
Publicado: (2024)

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
por: Bergach, Mohamed Amine
Publicado: (2026)

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
por: Chen, Chuangtao, et al.
Publicado: (2026)

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
por: Liu, Sihao, et al.
Publicado: (2026)

OjaKV: Context-Aware Online Low-Rank KV Cache Compression
por: Zhu, Yuxuan, et al.
Publicado: (2025)

Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
por: Zhang, Te, et al.
Publicado: (2025)

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
por: Swain, Kabir, et al.
Publicado: (2026)

Palu: Compressing KV-Cache with Low-Rank Projection
por: Chang, Chi-Chih, et al.
Publicado: (2024)

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
por: Saxena, Utkarsh, et al.
Publicado: (2024)

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
por: Liu, Zirui, et al.
Publicado: (2024)

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
por: Dong, Harry, et al.
Publicado: (2024)

ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
por: Datta, Debajyoti, et al.
Publicado: (2026)

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
por: Li, Yubo, et al.
Publicado: (2026)

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
por: Jeong, Bodon, et al.
Publicado: (2026)