Saved in:
Bibliographic Details
Main Authors: Chen, Deshu, Liu, Yuchen, Zhou, Zhijian, Qu, Chao, Qi, Yuan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.23087
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909812562853888
author Chen, Deshu
Liu, Yuchen
Zhou, Zhijian
Qu, Chao
Qi, Yuan
author_facet Chen, Deshu
Liu, Yuchen
Zhou, Zhijian
Qu, Chao
Qi, Yuan
contents Flow-based policies have recently emerged as a powerful tool in offline and offline-to-online reinforcement learning, capable of modeling the complex, multimodal behaviors found in pre-collected datasets. However, the full potential of these expressive actors is often bottlenecked by their critics, which typically learn a single, scalar estimate of the expected return. To address this limitation, we introduce the Distributional Flow Critic (DFC), a novel critic architecture that learns the complete state-action return distribution. Instead of regressing to a single value, DFC employs flow matching to model the distribution of return as a continuous, flexible transformation from a simple base distribution to the complex target distribution of returns. By doing so, DFC provides the expressive flow-based policy with a rich, distributional Bellman target, which offers a more stable and informative learning signal. Extensive experiments across D4RL and OGBench benchmarks demonstrate that our approach achieves strong performance, especially on tasks requiring multimodal action distributions, and excels in both offline and offline-to-online fine-tuning compared to existing methods.
format Preprint
id arxiv_https___arxiv_org_abs_2509_23087
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Unleashing Flow Policies with Distributional Critics
Chen, Deshu
Liu, Yuchen
Zhou, Zhijian
Qu, Chao
Qi, Yuan
Machine Learning
Flow-based policies have recently emerged as a powerful tool in offline and offline-to-online reinforcement learning, capable of modeling the complex, multimodal behaviors found in pre-collected datasets. However, the full potential of these expressive actors is often bottlenecked by their critics, which typically learn a single, scalar estimate of the expected return. To address this limitation, we introduce the Distributional Flow Critic (DFC), a novel critic architecture that learns the complete state-action return distribution. Instead of regressing to a single value, DFC employs flow matching to model the distribution of return as a continuous, flexible transformation from a simple base distribution to the complex target distribution of returns. By doing so, DFC provides the expressive flow-based policy with a rich, distributional Bellman target, which offers a more stable and informative learning signal. Extensive experiments across D4RL and OGBench benchmarks demonstrate that our approach achieves strong performance, especially on tasks requiring multimodal action distributions, and excels in both offline and offline-to-online fine-tuning compared to existing methods.
title Unleashing Flow Policies with Distributional Critics
topic Machine Learning
url https://arxiv.org/abs/2509.23087