Saved in:
Bibliographic Details
Main Authors: Wang, Yunhao, Zhang, Yuhao, Yu, Tinghao, Xu, Can, Zhang, Feng, Lian, Fengzong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.20101
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913860960649216
author Wang, Yunhao
Zhang, Yuhao
Yu, Tinghao
Xu, Can
Zhang, Feng
Lian, Fengzong
author_facet Wang, Yunhao
Zhang, Yuhao
Yu, Tinghao
Xu, Can
Zhang, Feng
Lian, Fengzong
contents Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long CoT. In this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model's initial token choice, thereby guiding the selection of the reasoning type. Evaluations on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.
format Preprint
id arxiv_https___arxiv_org_abs_2505_20101
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Adaptive Deep Reasoning: Triggering Deep Thinking When Needed
Wang, Yunhao
Zhang, Yuhao
Yu, Tinghao
Xu, Can
Zhang, Feng
Lian, Fengzong
Computation and Language
Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long CoT. In this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model's initial token choice, thereby guiding the selection of the reasoning type. Evaluations on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.
title Adaptive Deep Reasoning: Triggering Deep Thinking When Needed
topic Computation and Language
url https://arxiv.org/abs/2505.20101