Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fan, Siqi, Qin, Bowen, Han, Peng, Shang, Shuo, Wang, Yequan, Sun, Aixin
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2505.22017
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915552197345280
author	Fan, Siqi Qin, Bowen Han, Peng Shang, Shuo Wang, Yequan Sun, Aixin
author_facet	Fan, Siqi Qin, Bowen Han, Peng Shang, Shuo Wang, Yequan Sun, Aixin
contents	Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_22017
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models Fan, Siqi Qin, Bowen Han, Peng Shang, Shuo Wang, Yequan Sun, Aixin Computation and Language Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
title	The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2505.22017

Similar Items