Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Deshu, Liu, Yuchen, Zhou, Zhijian, Qu, Chao, Qi, Yuan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2509.23087
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909812562853888
author	Chen, Deshu Liu, Yuchen Zhou, Zhijian Qu, Chao Qi, Yuan
author_facet	Chen, Deshu Liu, Yuchen Zhou, Zhijian Qu, Chao Qi, Yuan
contents	Flow-based policies have recently emerged as a powerful tool in offline and offline-to-online reinforcement learning, capable of modeling the complex, multimodal behaviors found in pre-collected datasets. However, the full potential of these expressive actors is often bottlenecked by their critics, which typically learn a single, scalar estimate of the expected return. To address this limitation, we introduce the Distributional Flow Critic (DFC), a novel critic architecture that learns the complete state-action return distribution. Instead of regressing to a single value, DFC employs flow matching to model the distribution of return as a continuous, flexible transformation from a simple base distribution to the complex target distribution of returns. By doing so, DFC provides the expressive flow-based policy with a rich, distributional Bellman target, which offers a more stable and informative learning signal. Extensive experiments across D4RL and OGBench benchmarks demonstrate that our approach achieves strong performance, especially on tasks requiring multimodal action distributions, and excels in both offline and offline-to-online fine-tuning compared to existing methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_23087
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Unleashing Flow Policies with Distributional Critics Chen, Deshu Liu, Yuchen Zhou, Zhijian Qu, Chao Qi, Yuan Machine Learning Flow-based policies have recently emerged as a powerful tool in offline and offline-to-online reinforcement learning, capable of modeling the complex, multimodal behaviors found in pre-collected datasets. However, the full potential of these expressive actors is often bottlenecked by their critics, which typically learn a single, scalar estimate of the expected return. To address this limitation, we introduce the Distributional Flow Critic (DFC), a novel critic architecture that learns the complete state-action return distribution. Instead of regressing to a single value, DFC employs flow matching to model the distribution of return as a continuous, flexible transformation from a simple base distribution to the complex target distribution of returns. By doing so, DFC provides the expressive flow-based policy with a rich, distributional Bellman target, which offers a more stable and informative learning signal. Extensive experiments across D4RL and OGBench benchmarks demonstrate that our approach achieves strong performance, especially on tasks requiring multimodal action distributions, and excels in both offline and offline-to-online fine-tuning compared to existing methods.
title	Unleashing Flow Policies with Distributional Critics
topic	Machine Learning
url	https://arxiv.org/abs/2509.23087

Similar Items