Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Chang, Zhang, Yaren, Lv, Haoran, Cao, Qiong, Xue, Chao, He, Xiaodong
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence I.2.7
Online Access:	https://arxiv.org/abs/2507.16473
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916860589506560
author	Li, Chang Zhang, Yaren Lv, Haoran Cao, Qiong Xue, Chao He, Xiaodong
author_facet	Li, Chang Zhang, Yaren Lv, Haoran Cao, Qiong Xue, Chao He, Xiaodong
contents	Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_16473
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs Li, Chang Zhang, Yaren Lv, Haoran Cao, Qiong Xue, Chao He, Xiaodong Artificial Intelligence I.2.7 Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
title	Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs
topic	Artificial Intelligence I.2.7
url	https://arxiv.org/abs/2507.16473

Similar Items