Saved in:
Bibliographic Details
Main Authors: Nie, Wenhua, Liu, Junlin, Wu, Jianan, Meng, Zijie, Fan, Yilong, Zijian, Zhang, Zheng, Haoran, Jang, Jyh-Shing Roger
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.07686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911662196391936
author Nie, Wenhua
Liu, Junlin
Wu, Jianan
Meng, Zijie
Fan, Yilong
Zijian, Zhang
Zheng, Haoran
Jang, Jyh-Shing Roger
author_facet Nie, Wenhua
Liu, Junlin
Wu, Jianan
Meng, Zijie
Fan, Yilong
Zijian, Zhang
Zheng, Haoran
Jang, Jyh-Shing Roger
contents Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
format Preprint
id arxiv_https___arxiv_org_abs_2605_07686
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Nie, Wenhua
Liu, Junlin
Wu, Jianan
Meng, Zijie
Fan, Yilong
Zijian, Zhang
Zheng, Haoran
Jang, Jyh-Shing Roger
Machine Learning
Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
title The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
topic Machine Learning
url https://arxiv.org/abs/2605.07686