Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, ShaoZhen, Huang, Xinting, Peng, Houwen, Chen, Xin, Song, Xinyang, Li, Qi, Sun, Zhenan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.05616
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912811163058176
author	Liu, ShaoZhen Huang, Xinting Peng, Houwen Chen, Xin Song, Xinyang Li, Qi Sun, Zhenan
author_facet	Liu, ShaoZhen Huang, Xinting Peng, Houwen Chen, Xin Song, Xinyang Li, Qi Sun, Zhenan
contents	In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_05616
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks Liu, ShaoZhen Huang, Xinting Peng, Houwen Chen, Xin Song, Xinyang Li, Qi Sun, Zhenan Machine Learning In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
title	Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
topic	Machine Learning
url	https://arxiv.org/abs/2601.05616

Similar Items