Saved in:
Bibliographic Details
Main Authors: Wu, Junda, Xiong, Yuxin, Li, Xintong, Yu, Sheldon, Hu, Zhengmian, Yu, Tong, Wang, Rui, Chen, Xiang, Shang, Jingbo, McAuley, Julian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.08182
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918313180790784
author Wu, Junda
Xiong, Yuxin
Li, Xintong
Yu, Sheldon
Hu, Zhengmian
Yu, Tong
Wang, Rui
Chen, Xiang
Shang, Jingbo
McAuley, Julian
author_facet Wu, Junda
Xiong, Yuxin
Li, Xintong
Yu, Sheldon
Hu, Zhengmian
Yu, Tong
Wang, Rui
Chen, Xiang
Shang, Jingbo
McAuley, Julian
contents Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2507_08182
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CTRLS: Chain-of-Thought Reasoning via Latent State-Transition
Wu, Junda
Xiong, Yuxin
Li, Xintong
Yu, Sheldon
Hu, Zhengmian
Yu, Tong
Wang, Rui
Chen, Xiang
Shang, Jingbo
McAuley, Julian
Machine Learning
Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
title CTRLS: Chain-of-Thought Reasoning via Latent State-Transition
topic Machine Learning
url https://arxiv.org/abs/2507.08182