Saved in:
| Main Authors: | Liu, Jingyu, Chen, Beidi, Zhang, Ce |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.02789 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
by: Bachkaniwala, Rajveer, et al.
Published: (2026)
by: Bachkaniwala, Rajveer, et al.
Published: (2026)
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
by: Jin, Tao, et al.
Published: (2026)
by: Jin, Tao, et al.
Published: (2026)
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
by: Fan, Qihang, et al.
Published: (2026)
by: Fan, Qihang, et al.
Published: (2026)
PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference
by: Zhang, Hao, et al.
Published: (2025)
by: Zhang, Hao, et al.
Published: (2025)
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
by: Zhang, Chen, et al.
Published: (2024)
by: Zhang, Chen, et al.
Published: (2024)
Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
by: Liu, Ziyang
Published: (2026)
by: Liu, Ziyang
Published: (2026)
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
by: Xiao, Bin, et al.
Published: (2024)
by: Xiao, Bin, et al.
Published: (2024)
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
by: Hu, Shijing, et al.
Published: (2025)
by: Hu, Shijing, et al.
Published: (2025)
TokenButler: Token Importance is Predictable
by: Akhauri, Yash, et al.
Published: (2025)
by: Akhauri, Yash, et al.
Published: (2025)
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
by: Xiao, Bin, et al.
Published: (2024)
by: Xiao, Bin, et al.
Published: (2024)
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
by: Wu, Zhaoxuan, et al.
Published: (2025)
by: Wu, Zhaoxuan, et al.
Published: (2025)
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
by: Lv, Junlin, et al.
Published: (2024)
by: Lv, Junlin, et al.
Published: (2024)
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
by: Zhou, Yang, et al.
Published: (2025)
by: Zhou, Yang, et al.
Published: (2025)
Block-Attention for Efficient Prefilling
by: Ma, Dongyang, et al.
Published: (2024)
by: Ma, Dongyang, et al.
Published: (2024)
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
by: Elhoushi, Mostafa, et al.
Published: (2024)
by: Elhoushi, Mostafa, et al.
Published: (2024)
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
by: Sun, Wei, et al.
Published: (2025)
by: Sun, Wei, et al.
Published: (2025)
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
by: Ji, Yicheng, et al.
Published: (2025)
by: Ji, Yicheng, et al.
Published: (2025)
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
by: Xiao, Sibo, et al.
Published: (2025)
by: Xiao, Sibo, et al.
Published: (2025)
HAMburger: Accelerating LLM Inference via Token Smashing
by: Liu, Jingyu, et al.
Published: (2025)
by: Liu, Jingyu, et al.
Published: (2025)
Scaling Laws for Speculative Decoding
by: Yan, Siyuan, et al.
Published: (2025)
by: Yan, Siyuan, et al.
Published: (2025)
Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation
by: He, Linda, et al.
Published: (2025)
by: He, Linda, et al.
Published: (2025)
Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)
by: Xiao, Guangxuan, et al.
Published: (2023)
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
by: Liu, Yuxin, et al.
Published: (2026)
by: Liu, Yuxin, et al.
Published: (2026)
Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
by: Goddard, Charles, et al.
Published: (2025)
by: Goddard, Charles, et al.
Published: (2025)
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
by: Li, Jinze, et al.
Published: (2025)
by: Li, Jinze, et al.
Published: (2025)
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
by: Gautam, Aayush, et al.
Published: (2025)
by: Gautam, Aayush, et al.
Published: (2025)
Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)
by: Dong, Harry, et al.
Published: (2025)
Do LLMs Encode Functional Importance of Reasoning Tokens?
by: Singh, Janvijay, et al.
Published: (2026)
by: Singh, Janvijay, et al.
Published: (2026)
Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
by: Fu, Yuchen, et al.
Published: (2024)
by: Fu, Yuchen, et al.
Published: (2024)
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
by: Tian, Yuandong, et al.
Published: (2023)
by: Tian, Yuandong, et al.
Published: (2023)
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
by: Zhang, Jianguo, et al.
Published: (2025)
by: Zhang, Jianguo, et al.
Published: (2025)
Assessing Large Language Models for Online Extremism Research: Identification, Explanation, and New Knowledge
by: Dong, Beidi, et al.
Published: (2024)
by: Dong, Beidi, et al.
Published: (2024)
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
by: Weng, Yepeng, et al.
Published: (2025)
by: Weng, Yepeng, et al.
Published: (2025)
Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models
by: Ozyurt, Yilmazcan, et al.
Published: (2023)
by: Ozyurt, Yilmazcan, et al.
Published: (2023)
Speculative Decoding: Performance or Illusion?
by: Liu, Xiaoxuan, et al.
Published: (2025)
by: Liu, Xiaoxuan, et al.
Published: (2025)
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
by: Upasani, Shubhangi, et al.
Published: (2026)
by: Upasani, Shubhangi, et al.
Published: (2026)
TiDAR: Think in Diffusion, Talk in Autoregression
by: Liu, Jingyu, et al.
Published: (2025)
by: Liu, Jingyu, et al.
Published: (2025)
AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
by: Qing, Liu, et al.
Published: (2026)
by: Qing, Liu, et al.
Published: (2026)
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
by: Lv, Zheqi, et al.
Published: (2025)
by: Lv, Zheqi, et al.
Published: (2025)
Similar Items
-
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
by: Bachkaniwala, Rajveer, et al.
Published: (2026) -
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
by: Jin, Tao, et al.
Published: (2026) -
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
by: Fan, Qihang, et al.
Published: (2026) -
PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference
by: Zhang, Hao, et al.
Published: (2025) -
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
by: Zhang, Chen, et al.
Published: (2024)