Saved in:
Bibliographic Details
Main Authors: Liu, ShaoZhen, Huang, Xinting, Peng, Houwen, Chen, Xin, Song, Xinyang, Li, Qi, Sun, Zhenan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.05616
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912811163058176
author Liu, ShaoZhen
Huang, Xinting
Peng, Houwen
Chen, Xin
Song, Xinyang
Li, Qi
Sun, Zhenan
author_facet Liu, ShaoZhen
Huang, Xinting
Peng, Houwen
Chen, Xin
Song, Xinyang
Li, Qi
Sun, Zhenan
contents In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
format Preprint
id arxiv_https___arxiv_org_abs_2601_05616
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
Liu, ShaoZhen
Huang, Xinting
Peng, Houwen
Chen, Xin
Song, Xinyang
Li, Qi
Sun, Zhenan
Machine Learning
In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
title Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
topic Machine Learning
url https://arxiv.org/abs/2601.05616