MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Tangri, Rohan, Calliess, Jan-Peter
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning
Accesso online:	https://arxiv.org/abs/2601.22993
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913074670206976
author	Tangri, Rohan Calliess, Jan-Peter
author_facet	Tangri, Rohan Calliess, Jan-Peter
contents	We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22993
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk Tangri, Rohan Calliess, Jan-Peter Machine Learning We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.
title	Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk
topic	Machine Learning
url	https://arxiv.org/abs/2601.22993

Documenti analoghi