Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Ridel, Oran, Cohen, Alon
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning
Accès en ligne:	https://arxiv.org/abs/2602.16363
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910223088746496
author	Ridel, Oran Cohen, Alon
author_facet	Ridel, Oran Cohen, Alon
contents	We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_16363
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Improved Bounds for Reward-Agnostic and Reward-Free Exploration Ridel, Oran Cohen, Alon Machine Learning We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.
title	Improved Bounds for Reward-Agnostic and Reward-Free Exploration
topic	Machine Learning
url	https://arxiv.org/abs/2602.16363

Documents similaires