Saved in:
Bibliographic Details
Main Authors: Song, Woomin, Dingliwal, Saket, Jayanthi, Sai Muralidhar, Ganesh, Bhavana, Shin, Jinwoo, Galstyan, Aram, Bodapati, Sravan Babu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.04708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911703747264512
author Song, Woomin
Dingliwal, Saket
Jayanthi, Sai Muralidhar
Ganesh, Bhavana
Shin, Jinwoo
Galstyan, Aram
Bodapati, Sravan Babu
author_facet Song, Woomin
Dingliwal, Saket
Jayanthi, Sai Muralidhar
Ganesh, Bhavana
Shin, Jinwoo
Galstyan, Aram
Bodapati, Sravan Babu
contents Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2506_04708
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Song, Woomin
Dingliwal, Saket
Jayanthi, Sai Muralidhar
Ganesh, Bhavana
Shin, Jinwoo
Galstyan, Aram
Bodapati, Sravan Babu
Computation and Language
Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
title Accelerated Test-Time Scaling with Model-Free Speculative Sampling
topic Computation and Language
url https://arxiv.org/abs/2506.04708