Enregistré dans:
Détails bibliographiques
Auteurs principaux: Wang, Haonan, Liang, Weida, Fu, Zihang, Zheng, Nie, Zhang, Yifan, Tong, Yao, Zhu, Tongyao, Jiang, Hao, Li, Chuang, Wu, Jiaying, Kawaguchi, Kenji
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2509.23196
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866911181408567296
author Wang, Haonan
Liang, Weida
Fu, Zihang
Zheng, Nie
Zhang, Yifan
Tong, Yao
Zhu, Tongyao
Jiang, Hao
Li, Chuang
Wu, Jiaying
Kawaguchi, Kenji
author_facet Wang, Haonan
Liang, Weida
Fu, Zihang
Zheng, Nie
Zhang, Yifan
Tong, Yao
Zhu, Tongyao
Jiang, Hao
Li, Chuang
Wu, Jiaying
Kawaguchi, Kenji
contents Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
format Preprint
id arxiv_https___arxiv_org_abs_2509_23196
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs
Wang, Haonan
Liang, Weida
Fu, Zihang
Zheng, Nie
Zhang, Yifan
Tong, Yao
Zhu, Tongyao
Jiang, Hao
Li, Chuang
Wu, Jiaying
Kawaguchi, Kenji
Computation and Language
Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
title From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs
topic Computation and Language
url https://arxiv.org/abs/2509.23196