Saved in:
Bibliographic Details
Main Authors: Chen, Xuyang, Yan, Keyu, Cao, Wenhan, Zhao, Lin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.05126
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Offline reinforcement learning (RL) learns policies from fixed datasets without online interactions, but suffers from distribution shift, causing inaccurate evaluation and overestimation of out-of-distribution (OOD) actions. Existing methods counter this by conservatively discouraging all OOD actions, which limits generalization. We propose Advantage-based Diffusion Actor-Critic (ADAC), which evaluates OOD actions via an advantage-like function and uses it to modulate the Q-function update discriminatively. Our key insight is that the (state) value function is generally learned more reliably than the action-value function; we thus use the next-state value to indirectly assess each action. We develop a PointMaze environment to clearly visualize that advantage modulation effectively selects superior OOD actions while discouraging inferior ones. Moreover, extensive experiments on the D4RL benchmark show that ADAC achieves state-of-the-art performance, with especially strong gains on challenging tasks.