Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Donâncio, Henrique, Barrier, Antoine, South, Leah F., Forbes, Florence
Format: Preprint
Veröffentlicht: 2024
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2410.12598
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866908581470666752
author Donâncio, Henrique
Barrier, Antoine
South, Leah F.
Forbes, Florence
author_facet Donâncio, Henrique
Barrier, Antoine
South, Leah F.
Forbes, Florence
contents In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
format Preprint
id arxiv_https___arxiv_org_abs_2410_12598
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
Donâncio, Henrique
Barrier, Antoine
South, Leah F.
Forbes, Florence
Machine Learning
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
title Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
topic Machine Learning
url https://arxiv.org/abs/2410.12598