MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Chen, Zijun, Chen, Zaiwei, Si, Nian, Wang, Shengbo
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning
Accesso online:	https://arxiv.org/abs/2601.21301
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866915760634331136
author	Chen, Zijun Chen, Zaiwei Si, Nian Wang, Shengbo
author_facet	Chen, Zijun Chen, Zaiwei Si, Nian Wang, Shengbo
contents	We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_21301
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle Chen, Zijun Chen, Zaiwei Si, Nian Wang, Shengbo Machine Learning We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
title	Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle
topic	Machine Learning
url	https://arxiv.org/abs/2601.21301

Documenti analoghi