Saved in:
Bibliographic Details
Main Authors: Zheng, Tong, Liu, Haolin, Huang, Chengsong, Bao, Huiwen, Zhang, Sheng, Liu, Rui, Dai, Runpeng, Chen, Ruibo, Liu, Chenxi, Xiong, Tianyi, Wu, Xidong, Zhang, Hongming, Huang, Heng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.08083
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909034998661120
author Zheng, Tong
Liu, Haolin
Huang, Chengsong
Bao, Huiwen
Zhang, Sheng
Liu, Rui
Dai, Runpeng
Chen, Ruibo
Liu, Chenxi
Xiong, Tianyi
Wu, Xidong
Zhang, Hongming
Huang, Heng
author_facet Zheng, Tong
Liu, Haolin
Huang, Chengsong
Bao, Huiwen
Zhang, Sheng
Liu, Rui
Dai, Runpeng
Chen, Ruibo
Liu, Chenxi
Xiong, Tianyi
Wu, Xidong
Zhang, Hongming
Huang, Heng
contents Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
format Preprint
id arxiv_https___arxiv_org_abs_2605_08083
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Zheng, Tong
Liu, Haolin
Huang, Chengsong
Bao, Huiwen
Zhang, Sheng
Liu, Rui
Dai, Runpeng
Chen, Ruibo
Liu, Chenxi
Xiong, Tianyi
Wu, Xidong
Zhang, Hongming
Huang, Heng
Computation and Language
Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
title LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
topic Computation and Language
url https://arxiv.org/abs/2605.08083