Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Tolpin, David
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2602.17375
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866911586373861376
author	Tolpin, David
author_facet	Tolpin, David
contents	We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_17375
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MDP Planning as Policy Inference Tolpin, David Machine Learning We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.
title	MDP Planning as Policy Inference
topic	Machine Learning
url	https://arxiv.org/abs/2602.17375

Ähnliche Einträge