Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Song, Woomin, Dingliwal, Saket, Jayanthi, Sai Muralidhar, Ganesh, Bhavana, Shin, Jinwoo, Galstyan, Aram, Bodapati, Sravan Babu
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2506.04708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911703747264512
author	Song, Woomin Dingliwal, Saket Jayanthi, Sai Muralidhar Ganesh, Bhavana Shin, Jinwoo Galstyan, Aram Bodapati, Sravan Babu
author_facet	Song, Woomin Dingliwal, Saket Jayanthi, Sai Muralidhar Ganesh, Bhavana Shin, Jinwoo Galstyan, Aram Bodapati, Sravan Babu
contents	Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_04708
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Accelerated Test-Time Scaling with Model-Free Speculative Sampling Song, Woomin Dingliwal, Saket Jayanthi, Sai Muralidhar Ganesh, Bhavana Shin, Jinwoo Galstyan, Aram Bodapati, Sravan Babu Computation and Language Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
title	Accelerated Test-Time Scaling with Model-Free Speculative Sampling
topic	Computation and Language
url	https://arxiv.org/abs/2506.04708

Similar Items