Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Park, Jiyoung, Jang, Hankyu, Song, Changseok, Jung, Wookeun
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.05145
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914306742812672
author	Park, Jiyoung Jang, Hankyu Song, Changseok Jung, Wookeun
author_facet	Park, Jiyoung Jang, Hankyu Song, Changseok Jung, Wookeun
contents	Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_05145
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference Park, Jiyoung Jang, Hankyu Song, Changseok Jung, Wookeun Machine Learning Artificial Intelligence Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
title	TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2602.05145

Similar Items