Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Shimin, Chen, Xianwei, Shen, Yufan, Ye, Ziyuan, Wu, Jibin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.07558
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence. While promoting token-level diversity has shown promise in mitigating entropy collapse, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a framework that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX consistently incentivizes reasoning capability and outperforms existing token-level methods. Our project is available at https://github.com/ZhangShimin1/ReLaX.

Similar Items