Salvato in:
Dettagli Bibliografici
Autori principali: Gong, Yuehu, Wang, Zeyuan, Chen, Yulin, Ding, Shutong, Zhou, Qingyuan, Fu, Yanwei
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2603.21621
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866913169306288128
author Gong, Yuehu
Wang, Zeyuan
Chen, Yulin
Ding, Shutong
Zhou, Qingyuan
Fu, Yanwei
author_facet Gong, Yuehu
Wang, Zeyuan
Chen, Yulin
Ding, Shutong
Zhou, Qingyuan
Fu, Yanwei
contents Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schrödinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schrödinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.
format Preprint
id arxiv_https___arxiv_org_abs_2603_21621
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schrödinger Bridge
Gong, Yuehu
Wang, Zeyuan
Chen, Yulin
Ding, Shutong
Zhou, Qingyuan
Fu, Yanwei
Machine Learning
Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schrödinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schrödinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.
title Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schrödinger Bridge
topic Machine Learning
url https://arxiv.org/abs/2603.21621