Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nie, Wenhua, Liu, Junlin, Wu, Jianan, Meng, Zijie, Fan, Yilong, Zijian, Zhang, Zheng, Haoran, Jang, Jyh-Shing Roger
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.07686
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911662196391936
author	Nie, Wenhua Liu, Junlin Wu, Jianan Meng, Zijie Fan, Yilong Zijian, Zhang Zheng, Haoran Jang, Jyh-Shing Roger
author_facet	Nie, Wenhua Liu, Junlin Wu, Jianan Meng, Zijie Fan, Yilong Zijian, Zhang Zheng, Haoran Jang, Jyh-Shing Roger
contents	Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_07686
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits Nie, Wenhua Liu, Junlin Wu, Jianan Meng, Zijie Fan, Yilong Zijian, Zhang Zheng, Haoran Jang, Jyh-Shing Roger Machine Learning Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
title	The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
topic	Machine Learning
url	https://arxiv.org/abs/2605.07686

Similar Items