Guardado en:
| Autores principales: | Liu, Guangda, Li, Chengwei, Zhao, Jieru, Zhang, Chenqi, Guo, Minyi |
|---|---|
| Formato: | Preprint |
| Publicado: |
2024
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2412.03213 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
por: Liu, Guangda, et al.
Publicado: (2025)
por: Liu, Guangda, et al.
Publicado: (2025)
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
por: Zhang, Hang, et al.
Publicado: (2025)
por: Zhang, Hang, et al.
Publicado: (2025)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
por: Zhao, Youpeng, et al.
Publicado: (2024)
por: Zhao, Youpeng, et al.
Publicado: (2024)
Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
por: Barad, Haim, et al.
Publicado: (2023)
por: Barad, Haim, et al.
Publicado: (2023)
Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
por: Liu, Minghui, et al.
Publicado: (2025)
por: Liu, Minghui, et al.
Publicado: (2025)
OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
por: Ma, Xinyue, et al.
Publicado: (2026)
por: Ma, Xinyue, et al.
Publicado: (2026)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
por: Zandieh, Amir, et al.
Publicado: (2024)
por: Zandieh, Amir, et al.
Publicado: (2024)
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
por: Zhou, Zhongzhu, et al.
Publicado: (2026)
por: Zhou, Zhongzhu, et al.
Publicado: (2026)
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
por: Liu, Hongyao, et al.
Publicado: (2026)
por: Liu, Hongyao, et al.
Publicado: (2026)
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
por: Zhao, Yi, et al.
Publicado: (2025)
por: Zhao, Yi, et al.
Publicado: (2025)
Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models
por: Sai, Vishnu, et al.
Publicado: (2026)
por: Sai, Vishnu, et al.
Publicado: (2026)
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
por: Liu, Minghui, et al.
Publicado: (2024)
por: Liu, Minghui, et al.
Publicado: (2024)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
por: Lin, Yujun, et al.
Publicado: (2024)
por: Lin, Yujun, et al.
Publicado: (2024)
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
por: Ning, Zhenyu, et al.
Publicado: (2024)
por: Ning, Zhenyu, et al.
Publicado: (2024)
GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
por: Taneja, Maanas, et al.
Publicado: (2026)
por: Taneja, Maanas, et al.
Publicado: (2026)
The Pitfalls of KV Cache Compression
por: Chen, Alex, et al.
Publicado: (2025)
por: Chen, Alex, et al.
Publicado: (2025)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
por: Fang, Yunhua, et al.
Publicado: (2025)
por: Fang, Yunhua, et al.
Publicado: (2025)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
por: Wang, Yixuan, et al.
Publicado: (2025)
por: Wang, Yixuan, et al.
Publicado: (2025)
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
por: Liu, Andy Zeyi, et al.
Publicado: (2026)
por: Liu, Andy Zeyi, et al.
Publicado: (2026)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
por: Tian, Yuxuan, et al.
Publicado: (2025)
por: Tian, Yuxuan, et al.
Publicado: (2025)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
por: Zhu, Yuxuan, et al.
Publicado: (2025)
por: Zhu, Yuxuan, et al.
Publicado: (2025)
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
por: Feng, Shaoting, et al.
Publicado: (2025)
por: Feng, Shaoting, et al.
Publicado: (2025)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
por: Yan, Xianglong, et al.
Publicado: (2025)
por: Yan, Xianglong, et al.
Publicado: (2025)
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training
por: Yi, Qingao, et al.
Publicado: (2025)
por: Yi, Qingao, et al.
Publicado: (2025)
KVSculpt: KV Cache Compression as Distillation
por: Jiang, Bo, et al.
Publicado: (2026)
por: Jiang, Bo, et al.
Publicado: (2026)
CoKV: Optimizing KV Cache Allocation via Cooperative Game
por: Sun, Qiheng, et al.
Publicado: (2025)
por: Sun, Qiheng, et al.
Publicado: (2025)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
por: Kang, Hao, et al.
Publicado: (2024)
por: Kang, Hao, et al.
Publicado: (2024)
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
por: Bergach, Mohamed Amine
Publicado: (2026)
por: Bergach, Mohamed Amine
Publicado: (2026)
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
por: Chen, Chuangtao, et al.
Publicado: (2026)
por: Chen, Chuangtao, et al.
Publicado: (2026)
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
por: Liu, Sihao, et al.
Publicado: (2026)
por: Liu, Sihao, et al.
Publicado: (2026)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
por: Zhu, Yuxuan, et al.
Publicado: (2025)
por: Zhu, Yuxuan, et al.
Publicado: (2025)
Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
por: Zhang, Te, et al.
Publicado: (2025)
por: Zhang, Te, et al.
Publicado: (2025)
Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
por: Swain, Kabir, et al.
Publicado: (2026)
por: Swain, Kabir, et al.
Publicado: (2026)
Palu: Compressing KV-Cache with Low-Rank Projection
por: Chang, Chi-Chih, et al.
Publicado: (2024)
por: Chang, Chi-Chih, et al.
Publicado: (2024)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
por: Saxena, Utkarsh, et al.
Publicado: (2024)
por: Saxena, Utkarsh, et al.
Publicado: (2024)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
por: Liu, Zirui, et al.
Publicado: (2024)
por: Liu, Zirui, et al.
Publicado: (2024)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
por: Dong, Harry, et al.
Publicado: (2024)
por: Dong, Harry, et al.
Publicado: (2024)
ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
por: Datta, Debajyoti, et al.
Publicado: (2026)
por: Datta, Debajyoti, et al.
Publicado: (2026)
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
por: Li, Yubo, et al.
Publicado: (2026)
por: Li, Yubo, et al.
Publicado: (2026)
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
por: Jeong, Bodon, et al.
Publicado: (2026)
por: Jeong, Bodon, et al.
Publicado: (2026)
Ejemplares similares
-
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
por: Liu, Guangda, et al.
Publicado: (2025) -
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
por: Zhang, Hang, et al.
Publicado: (2025) -
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
por: Zhao, Youpeng, et al.
Publicado: (2024) -
Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
por: Barad, Haim, et al.
Publicado: (2023) -
Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
por: Liu, Minghui, et al.
Publicado: (2025)