MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Gong, Yuehu, Wang, Zeyuan, Chen, Yulin, Ding, Shutong, Zhou, Qingyuan, Fu, Yanwei
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning
Accesso online:	https://arxiv.org/abs/2603.21621
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913169306288128
author	Gong, Yuehu Wang, Zeyuan Chen, Yulin Ding, Shutong Zhou, Qingyuan Fu, Yanwei
author_facet	Gong, Yuehu Wang, Zeyuan Chen, Yulin Ding, Shutong Zhou, Qingyuan Fu, Yanwei
contents	Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schrödinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schrödinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_21621
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schrödinger Bridge Gong, Yuehu Wang, Zeyuan Chen, Yulin Ding, Shutong Zhou, Qingyuan Fu, Yanwei Machine Learning Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schrödinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schrödinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.
title	Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schrödinger Bridge
topic	Machine Learning
url	https://arxiv.org/abs/2603.21621

Documenti analoghi