Saved in:
Bibliographic Details
Main Authors: Wang, Qixin, Cao, Hao, Hu, Jian-Qiang, Hu, Mingjie, Xia, Li
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.09734
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911503252193280
author Wang, Qixin
Cao, Hao
Hu, Jian-Qiang
Hu, Mingjie
Xia, Li
author_facet Wang, Qixin
Cao, Hao
Hu, Jian-Qiang
Hu, Mingjie
Xia, Li
contents Conditional value-at-risk (CVaR) is a prominent risk measure in financial engineering, energy systems, and supply chain management. In these domains, Markov decision processes (MDPs) with a long-run CVaR criterion effectively mitigate cost variability over a specified horizon. However, implementing MDPs relies on known transition models, which are typically unavailable in practice. This necessitates a model-free approach to risk-sensitive dynamic optimization. To tackle this challenge, we propose a reinforcement learning algorithm that simultaneously conducts policy evaluation and improvement based on a CVaR-specific Bellman local optimality equation. This algorithm employs a nonparametric incremental learning approach for policy improvement, relying on a single sample trajectory to identify the optimal policy. Under appropriate technical conditions, we prove almost sure convergence of the algorithm and derive its convergence rate. Our analysis reveals that the optimal convergence rate, measured by the mean absolute error of policy estimators, is of order O(1/n). Our main algorithm and results are further extended to solving the mean-CVaR optimization problem. Numerical experiments corroborate these results.
format Preprint
id arxiv_https___arxiv_org_abs_2603_09734
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Long-Run Conditional Value-at-Risk Reinforcement Learning
Wang, Qixin
Cao, Hao
Hu, Jian-Qiang
Hu, Mingjie
Xia, Li
Optimization and Control
Conditional value-at-risk (CVaR) is a prominent risk measure in financial engineering, energy systems, and supply chain management. In these domains, Markov decision processes (MDPs) with a long-run CVaR criterion effectively mitigate cost variability over a specified horizon. However, implementing MDPs relies on known transition models, which are typically unavailable in practice. This necessitates a model-free approach to risk-sensitive dynamic optimization. To tackle this challenge, we propose a reinforcement learning algorithm that simultaneously conducts policy evaluation and improvement based on a CVaR-specific Bellman local optimality equation. This algorithm employs a nonparametric incremental learning approach for policy improvement, relying on a single sample trajectory to identify the optimal policy. Under appropriate technical conditions, we prove almost sure convergence of the algorithm and derive its convergence rate. Our analysis reveals that the optimal convergence rate, measured by the mean absolute error of policy estimators, is of order O(1/n). Our main algorithm and results are further extended to solving the mean-CVaR optimization problem. Numerical experiments corroborate these results.
title Long-Run Conditional Value-at-Risk Reinforcement Learning
topic Optimization and Control
url https://arxiv.org/abs/2603.09734