Saved in:
| Main Authors: | Shen, Yingtao, Zou, An |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.18396 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
NEAT: Neuron-Based Early Exit for Large Reasoning Models
by: Liu, Kang, et al.
Published: (2026)
by: Liu, Kang, et al.
Published: (2026)
Effectively Compress KV Heads for LLM
by: Yu, Hao, et al.
Published: (2024)
by: Yu, Hao, et al.
Published: (2024)
Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs
by: Zhu, Jiace, et al.
Published: (2024)
by: Zhu, Jiace, et al.
Published: (2024)
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
by: Wu, You, et al.
Published: (2024)
by: Wu, You, et al.
Published: (2024)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
SpecExit: Accelerating Large Reasoning Model via Speculative Exit
by: Yang, Rubing, et al.
Published: (2025)
by: Yang, Rubing, et al.
Published: (2025)
Accelerating Large Language Model Inference with Self-Supervised Early Exits
by: Valade, Florian
Published: (2024)
by: Valade, Florian
Published: (2024)
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)
by: Guo, Jinyu, et al.
Published: (2026)
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference
by: Shen, Haiying, et al.
Published: (2025)
by: Shen, Haiying, et al.
Published: (2025)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
by: Wu, Haoyi, et al.
Published: (2024)
by: Wu, Haoyi, et al.
Published: (2024)
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
by: Liu, Yuhan, et al.
Published: (2024)
by: Liu, Yuhan, et al.
Published: (2024)
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
by: Jiang, Tanqiu, et al.
Published: (2024)
by: Jiang, Tanqiu, et al.
Published: (2024)
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)
by: Gu, Yifeng, et al.
Published: (2025)
Defending against Jailbreak through Early Exit Generation of Large Language Models
by: Zhao, Chongwen, et al.
Published: (2024)
by: Zhao, Chongwen, et al.
Published: (2024)
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
by: Lin, Qika, et al.
Published: (2025)
by: Lin, Qika, et al.
Published: (2025)
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
by: Yang, Jingbo, et al.
Published: (2025)
by: Yang, Jingbo, et al.
Published: (2025)
A Method for Building Large Language Models with Predefined KV Cache Capacity
by: Yi, Zhonghua, et al.
Published: (2024)
by: Yi, Zhonghua, et al.
Published: (2024)
KV Shifting Attention Enhances Language Modeling
by: Xu, Mingyu, et al.
Published: (2024)
by: Xu, Mingyu, et al.
Published: (2024)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
BriLLM: Brain-inspired Large Language Model
by: Zhao, Hai, et al.
Published: (2025)
by: Zhao, Hai, et al.
Published: (2025)
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
by: Chen, Qian, et al.
Published: (2025)
by: Chen, Qian, et al.
Published: (2025)
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
by: Kumar, Avinash, et al.
Published: (2025)
by: Kumar, Avinash, et al.
Published: (2025)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments
by: Lu, Qingyu, et al.
Published: (2025)
by: Lu, Qingyu, et al.
Published: (2025)
SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token
by: Ma, Ming, et al.
Published: (2025)
by: Ma, Ming, et al.
Published: (2025)
Seamless Deception: Larger Language Models Are Better Knowledge Concealers
by: Ashok, Dhananjay, et al.
Published: (2026)
by: Ashok, Dhananjay, et al.
Published: (2026)
WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models
by: Yuan, Jian, et al.
Published: (2025)
by: Yuan, Jian, et al.
Published: (2025)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
dKV-Cache: The Cache for Diffusion Language Models
by: Ma, Xinyin, et al.
Published: (2025)
by: Ma, Xinyin, et al.
Published: (2025)
EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models
by: Pan, Xuchen, et al.
Published: (2024)
by: Pan, Xuchen, et al.
Published: (2024)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration
by: Shah, Soham, et al.
Published: (2025)
by: Shah, Soham, et al.
Published: (2025)
Beyond KV Caching: Shared Attention for Efficient LLMs
by: Liao, Bingli, et al.
Published: (2024)
by: Liao, Bingli, et al.
Published: (2024)
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
by: Luo, Yulin, et al.
Published: (2024)
by: Luo, Yulin, et al.
Published: (2024)
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
by: Liu, Ruikang, et al.
Published: (2024)
by: Liu, Ruikang, et al.
Published: (2024)
Dynamic Early Exit in Reasoning Models
by: Yang, Chenxu, et al.
Published: (2025)
by: Yang, Chenxu, et al.
Published: (2025)
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)
by: Zuo, Youhui, et al.
Published: (2025)
AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization
by: Shan, Weiqiao, et al.
Published: (2024)
by: Shan, Weiqiao, et al.
Published: (2024)
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)
by: Ji, Shiyu, et al.
Published: (2026)
Similar Items
-
NEAT: Neuron-Based Early Exit for Large Reasoning Models
by: Liu, Kang, et al.
Published: (2026) -
Effectively Compress KV Heads for LLM
by: Yu, Hao, et al.
Published: (2024) -
Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs
by: Zhu, Jiace, et al.
Published: (2024) -
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
by: Wu, You, et al.
Published: (2024) -
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)