Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Özalp, Uğurcan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Systems and Control
Online Access:	https://arxiv.org/abs/2601.00737
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917180512141312
author	Özalp, Uğurcan
author_facet	Özalp, Uğurcan
contents	Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic's epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_00737
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty Özalp, Uğurcan Machine Learning Artificial Intelligence Systems and Control Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic's epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
title	Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty
topic	Machine Learning Artificial Intelligence Systems and Control
url	https://arxiv.org/abs/2601.00737

Similar Items