Saved in:
| Main Authors: | Wang, Yixuan, Liu, Yijun, ji, Shiyu, Xu, Yuzhuang, Xu, Yang, Zhu, Qingfu, Che, Wanxiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.18629 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CRVQ: Channel-Relaxed Vector Quantization for Extreme Compression of LLMs
by: Xu, Yuzhuang, et al.
Published: (2024)
by: Xu, Yuzhuang, et al.
Published: (2024)
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
by: Xu, Yuzhuang, et al.
Published: (2025)
by: Xu, Yuzhuang, et al.
Published: (2025)
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
by: Liu, Yijun, et al.
Published: (2025)
by: Liu, Yijun, et al.
Published: (2025)
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
by: Wang, Yixuan, et al.
Published: (2024)
by: Wang, Yixuan, et al.
Published: (2024)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling
by: Ji, Shiyu, et al.
Published: (2025)
by: Ji, Shiyu, et al.
Published: (2025)
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)
by: Ji, Shiyu, et al.
Published: (2026)
Multi-Layer Attention is the Amplifier of Demonstration Effectiveness
by: Wang, Dingzirui, et al.
Published: (2025)
by: Wang, Dingzirui, et al.
Published: (2025)
ProxyAttn: Guided Sparse Attention via Representative Heads
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
by: Thomas, Rahul, et al.
Published: (2026)
by: Thomas, Rahul, et al.
Published: (2026)
Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification
by: Li, Bohan, et al.
Published: (2023)
by: Li, Bohan, et al.
Published: (2023)
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
by: Wang, Ziyan, et al.
Published: (2025)
by: Wang, Ziyan, et al.
Published: (2025)
Faster Cascades via Speculative Decoding
by: Narasimhan, Harikrishna, et al.
Published: (2024)
by: Narasimhan, Harikrishna, et al.
Published: (2024)
OneBit: Towards Extremely Low-bit Large Language Models
by: Xu, Yuzhuang, et al.
Published: (2024)
by: Xu, Yuzhuang, et al.
Published: (2024)
HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization
by: Shan, Baocai, et al.
Published: (2026)
by: Shan, Baocai, et al.
Published: (2026)
Self-Speculative Biased Decoding for Faster Re-Translation
by: Zeng, Linxiao, et al.
Published: (2025)
by: Zeng, Linxiao, et al.
Published: (2025)
Traversal Verification for Speculative Tree Decoding
by: Weng, Yepeng, et al.
Published: (2025)
by: Weng, Yepeng, et al.
Published: (2025)
Improving Grammatical Error Correction via Contextual Data Augmentation
by: Wang, Yixuan, et al.
Published: (2024)
by: Wang, Yixuan, et al.
Published: (2024)
MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
by: Song, Jingwei, et al.
Published: (2026)
by: Song, Jingwei, et al.
Published: (2026)
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
by: Xiong, Yunfan, et al.
Published: (2024)
by: Xiong, Yunfan, et al.
Published: (2024)
Think Twice Before You Act: Improving Inverse Problem Solving With MCMC
by: Zhu, Yaxuan, et al.
Published: (2024)
by: Zhu, Yaxuan, et al.
Published: (2024)
Block Verification Accelerates Speculative Decoding
by: Sun, Ziteng, et al.
Published: (2024)
by: Sun, Ziteng, et al.
Published: (2024)
ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
by: Xu, Yuzhuang, et al.
Published: (2026)
by: Xu, Yuzhuang, et al.
Published: (2026)
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
by: Jiang, Haoyun, et al.
Published: (2026)
by: Jiang, Haoyun, et al.
Published: (2026)
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning
by: Ma, Zhiyuan, et al.
Published: (2025)
by: Ma, Zhiyuan, et al.
Published: (2025)
Speculative Speculative Decoding
by: Kumar, Tanishq, et al.
Published: (2026)
by: Kumar, Tanishq, et al.
Published: (2026)
Speculative Safety-Aware Decoding
by: Wang, Xuekang, et al.
Published: (2025)
by: Wang, Xuekang, et al.
Published: (2025)
Think Before You Act: Decision Transformers with Working Memory
by: Kang, Jikun, et al.
Published: (2023)
by: Kang, Jikun, et al.
Published: (2023)
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
by: Li, Xingyao, et al.
Published: (2026)
by: Li, Xingyao, et al.
Published: (2026)
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)
by: Shoham, Ofir Ben
Published: (2026)
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
by: Bachmann, Gregor, et al.
Published: (2025)
by: Bachmann, Gregor, et al.
Published: (2025)
Speculative Decoding for Verilog: Speed and Quality, All in One
by: Xu, Changran, et al.
Published: (2025)
by: Xu, Changran, et al.
Published: (2025)
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)
by: Li, Cheng, et al.
Published: (2025)
Speeding up Speculative Decoding via Sequential Approximate Verification
by: Zhong, Meiyu, et al.
Published: (2025)
by: Zhong, Meiyu, et al.
Published: (2025)
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)
by: Yang, Penghui, et al.
Published: (2025)
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
by: Zhang, Yudi, et al.
Published: (2025)
by: Zhang, Yudi, et al.
Published: (2025)
Think Before You Lie: How Reasoning Leads to Honesty
by: Yuan, Ann, et al.
Published: (2026)
by: Yuan, Ann, et al.
Published: (2026)
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
by: Luo, Xianzhen, et al.
Published: (2024)
by: Luo, Xianzhen, et al.
Published: (2024)
Decoding Speculative Decoding
by: Yan, Minghao, et al.
Published: (2024)
by: Yan, Minghao, et al.
Published: (2024)
Similar Items
-
CRVQ: Channel-Relaxed Vector Quantization for Extreme Compression of LLMs
by: Xu, Yuzhuang, et al.
Published: (2024) -
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
by: Xu, Yuzhuang, et al.
Published: (2025) -
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
by: Wang, Yixuan, et al.
Published: (2025) -
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
by: Liu, Yijun, et al.
Published: (2025) -
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
by: Wang, Yixuan, et al.
Published: (2024)