Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Divilkovskiy, Maxim, Malygin, Vitaly, Zlobin, Sergey, Ilyushin, Stanislav, Isali, Sultan, Kalugin, Vasily, Aitassova, Nuriza, Yi, Fei, Zeng, Weidi
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2508.09072
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915518727847936
author Divilkovskiy, Maxim
Malygin, Vitaly
Zlobin, Sergey
Ilyushin, Stanislav
Isali, Sultan
Kalugin, Vasily
Aitassova, Nuriza
Yi, Fei
Zeng, Weidi
author_facet Divilkovskiy, Maxim
Malygin, Vitaly
Zlobin, Sergey
Ilyushin, Stanislav
Isali, Sultan
Kalugin, Vasily
Aitassova, Nuriza
Yi, Fei
Zeng, Weidi
contents Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.
format Preprint
id arxiv_https___arxiv_org_abs_2508_09072
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Divilkovskiy, Maxim
Malygin, Vitaly
Zlobin, Sergey
Ilyushin, Stanislav
Isali, Sultan
Kalugin, Vasily
Aitassova, Nuriza
Yi, Fei
Zeng, Weidi
Computation and Language
Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.
title READER: Retrieval-Assisted Drafter for Efficient LLM Inference
topic Computation and Language
url https://arxiv.org/abs/2508.09072