Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ferraro, Stefano, Nakano, Akihiro, Suzuki, Masahiro, Matsuo, Yutaka
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.06136
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918195665829888
author	Ferraro, Stefano Nakano, Akihiro Suzuki, Masahiro Matsuo, Yutaka
author_facet	Ferraro, Stefano Nakano, Akihiro Suzuki, Masahiro Matsuo, Yutaka
contents	Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_06136
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks Ferraro, Stefano Nakano, Akihiro Suzuki, Masahiro Matsuo, Yutaka Artificial Intelligence Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
title	When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks
topic	Artificial Intelligence
url	https://arxiv.org/abs/2511.06136

Similar Items