Saved in:
Bibliographic Details
Main Authors: Park, Jiyoung, Jang, Hankyu, Song, Changseok, Jung, Wookeun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.05145
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914306742812672
author Park, Jiyoung
Jang, Hankyu
Song, Changseok
Jung, Wookeun
author_facet Park, Jiyoung
Jang, Hankyu
Song, Changseok
Jung, Wookeun
contents Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
format Preprint
id arxiv_https___arxiv_org_abs_2602_05145
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
Park, Jiyoung
Jang, Hankyu
Song, Changseok
Jung, Wookeun
Machine Learning
Artificial Intelligence
Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
title TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2602.05145