:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Yixuan, Liu, Yijun, ji, Shiyu, Xu, Yuzhuang, Xu, Yang, Zhu, Qingfu, Che, Wanxiang
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2505.18629
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CRVQ: Channel-Relaxed Vector Quantization for Extreme Compression of LLMs
by: Xu, Yuzhuang, et al.
Published: (2024)

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
by: Xu, Yuzhuang, et al.
Published: (2025)

Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
by: Wang, Yixuan, et al.
Published: (2025)

Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
by: Liu, Yijun, et al.
Published: (2025)

Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
by: Wang, Yixuan, et al.
Published: (2024)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)

Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling
by: Ji, Shiyu, et al.
Published: (2025)

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
by: Ji, Shiyu, et al.
Published: (2026)

Multi-Layer Attention is the Amplifier of Demonstration Effectiveness
by: Wang, Dingzirui, et al.
Published: (2025)

ProxyAttn: Guided Sparse Attention via Representative Heads
by: Wang, Yixuan, et al.
Published: (2025)

Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
by: Thomas, Rahul, et al.
Published: (2026)

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification
by: Li, Bohan, et al.
Published: (2023)

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
by: Wang, Ziyan, et al.
Published: (2025)

Faster Cascades via Speculative Decoding
by: Narasimhan, Harikrishna, et al.
Published: (2024)

OneBit: Towards Extremely Low-bit Large Language Models
by: Xu, Yuzhuang, et al.
Published: (2024)

HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization
by: Shan, Baocai, et al.
Published: (2026)

Self-Speculative Biased Decoding for Faster Re-Translation
by: Zeng, Linxiao, et al.
Published: (2025)

Traversal Verification for Speculative Tree Decoding
by: Weng, Yepeng, et al.
Published: (2025)

Improving Grammatical Error Correction via Contextual Data Augmentation
by: Wang, Yixuan, et al.
Published: (2024)

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
by: Song, Jingwei, et al.
Published: (2026)

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
by: Xiong, Yunfan, et al.
Published: (2024)

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC
by: Zhu, Yaxuan, et al.
Published: (2024)

Block Verification Accelerates Speculative Decoding
by: Sun, Ziteng, et al.
Published: (2024)

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
by: Xu, Yuzhuang, et al.
Published: (2026)

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
by: Jiang, Haoyun, et al.
Published: (2026)

Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning
by: Ma, Zhiyuan, et al.
Published: (2025)

Speculative Speculative Decoding
by: Kumar, Tanishq, et al.
Published: (2026)

Speculative Safety-Aware Decoding
by: Wang, Xuekang, et al.
Published: (2025)

Think Before You Act: Decision Transformers with Working Memory
by: Kang, Jikun, et al.
Published: (2023)

Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
by: Li, Xingyao, et al.
Published: (2026)

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
by: Bachmann, Gregor, et al.
Published: (2025)

Speculative Decoding for Verilog: Speed and Quality, All in One
by: Xu, Changran, et al.
Published: (2025)

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)

Speeding up Speculative Decoding via Sequential Approximate Verification
by: Zhong, Meiyu, et al.
Published: (2025)

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
by: Zhang, Yudi, et al.
Published: (2025)

Think Before You Lie: How Reasoning Leads to Honesty
by: Yuan, Ann, et al.
Published: (2026)

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
by: Luo, Xianzhen, et al.
Published: (2024)

Decoding Speculative Decoding
by: Yan, Minghao, et al.
Published: (2024)