Saved in:
Bibliographic Details
Main Authors: Huang, Haidong, Song, Haiyue Zhu. Jiayu, Zhao, Xixin, Zhou, Yaohua, Zhang, Jiayi, Zhai, Yuze, Li, Xiaocong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.10087
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908650368401408
author Huang, Haidong
Song, Haiyue Zhu. Jiayu
Zhao, Xixin
Zhou, Yaohua
Zhang, Jiayi
Zhai, Yuze
Li, Xiaocong
author_facet Huang, Haidong
Song, Haiyue Zhu. Jiayu
Zhao, Xixin
Zhou, Yaohua
Zhang, Jiayi
Zhai, Yuze
Li, Xiaocong
contents Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
format Preprint
id arxiv_https___arxiv_org_abs_2511_10087
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning
Huang, Haidong
Song, Haiyue Zhu. Jiayu
Zhao, Xixin
Zhou, Yaohua
Zhang, Jiayi
Zhai, Yuze
Li, Xiaocong
Robotics
Artificial Intelligence
Machine Learning
68T05
I.2.8; I.2.9
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
title Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning
topic Robotics
Artificial Intelligence
Machine Learning
68T05
I.2.8; I.2.9
url https://arxiv.org/abs/2511.10087