:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Jingyu, Chen, Beidi, Zhang, Ce
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.02789
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
by: Bachkaniwala, Rajveer, et al.
Published: (2026)

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
by: Jin, Tao, et al.
Published: (2026)

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
by: Fan, Qihang, et al.
Published: (2026)

PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference
by: Zhang, Hao, et al.
Published: (2025)

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
by: Zhang, Chen, et al.
Published: (2024)

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
by: Liu, Ziyang
Published: (2026)

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
by: Xiao, Bin, et al.
Published: (2024)

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
by: Hu, Shijing, et al.
Published: (2025)

TokenButler: Token Importance is Predictable
by: Akhauri, Yash, et al.
Published: (2025)

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
by: Xiao, Bin, et al.
Published: (2024)

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
by: Wu, Zhaoxuan, et al.
Published: (2025)

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
by: Lv, Junlin, et al.
Published: (2024)

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
by: Zhou, Yang, et al.
Published: (2025)

Block-Attention for Efficient Prefilling
by: Ma, Dongyang, et al.
Published: (2024)

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
by: Elhoushi, Mostafa, et al.
Published: (2024)

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
by: Sun, Wei, et al.
Published: (2025)

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
by: Ji, Yicheng, et al.
Published: (2025)

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
by: Xiao, Sibo, et al.
Published: (2025)

HAMburger: Accelerating LLM Inference via Token Smashing
by: Liu, Jingyu, et al.
Published: (2025)

Scaling Laws for Speculative Decoding
by: Yan, Siyuan, et al.
Published: (2025)

Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation
by: He, Linda, et al.
Published: (2025)

Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)

Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
by: Liu, Yuxin, et al.
Published: (2026)

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
by: Goddard, Charles, et al.
Published: (2025)

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
by: Li, Jinze, et al.
Published: (2025)

Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
by: Gautam, Aayush, et al.
Published: (2025)

Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)

Do LLMs Encode Functional Importance of Reasoning Tokens?
by: Singh, Janvijay, et al.
Published: (2026)

Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
by: Fu, Yuchen, et al.
Published: (2024)

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
by: Tian, Yuandong, et al.
Published: (2023)

ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
by: Zhang, Jianguo, et al.
Published: (2025)

Assessing Large Language Models for Online Extremism Research: Identification, Explanation, and New Knowledge
by: Dong, Beidi, et al.
Published: (2024)

CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
by: Weng, Yepeng, et al.
Published: (2025)

Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models
by: Ozyurt, Yilmazcan, et al.
Published: (2023)

Speculative Decoding: Performance or Illusion?
by: Liu, Xiaoxuan, et al.
Published: (2025)

Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
by: Upasani, Shubhangi, et al.
Published: (2026)

TiDAR: Think in Diffusion, Talk in Autoregression
by: Liu, Jingyu, et al.
Published: (2025)

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
by: Qing, Liu, et al.
Published: (2026)

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
by: Lv, Zheqi, et al.
Published: (2025)