Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Wang, Haonan, Liang, Weida, Fu, Zihang, Zheng, Nie, Zhang, Yifan, Tong, Yao, Zhu, Tongyao, Jiang, Hao, Li, Chuang, Wu, Jiaying, Kawaguchi, Kenji
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2509.23196
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866911181408567296
author	Wang, Haonan Liang, Weida Fu, Zihang Zheng, Nie Zhang, Yifan Tong, Yao Zhu, Tongyao Jiang, Hao Li, Chuang Wu, Jiaying Kawaguchi, Kenji
author_facet	Wang, Haonan Liang, Weida Fu, Zihang Zheng, Nie Zhang, Yifan Tong, Yao Zhu, Tongyao Jiang, Hao Li, Chuang Wu, Jiaying Kawaguchi, Kenji
contents	Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_23196
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs Wang, Haonan Liang, Weida Fu, Zihang Zheng, Nie Zhang, Yifan Tong, Yao Zhu, Tongyao Jiang, Hao Li, Chuang Wu, Jiaying Kawaguchi, Kenji Computation and Language Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
title	From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs
topic	Computation and Language
url	https://arxiv.org/abs/2509.23196

Documents similaires