Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	de Vries, Joery A., He, Jinke, Oren, Yaniv, Spaan, Matthijs T. J.
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2504.06048
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918085744656384
author	de Vries, Joery A. He, Jinke Oren, Yaniv Spaan, Matthijs T. J.
author_facet	de Vries, Joery A. He, Jinke Oren, Yaniv Spaan, Matthijs T. J.
contents	Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_06048
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Trust-Region Twisted Policy Improvement de Vries, Joery A. He, Jinke Oren, Yaniv Spaan, Matthijs T. J. Machine Learning Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.
title	Trust-Region Twisted Policy Improvement
topic	Machine Learning
url	https://arxiv.org/abs/2504.06048

Similar Items