Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Buschoff, Luca M. Schulze, Voudouris, Konstantinos, Demircan, Can, Schulz, Eric
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.06033
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911740515581952
author	Buschoff, Luca M. Schulze Voudouris, Konstantinos Demircan, Can Schulz, Eric
author_facet	Buschoff, Luca M. Schulze Voudouris, Konstantinos Demircan, Can Schulz, Eric
contents	Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_06033
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Can Vision Language Models Learn Intuitive Physics from Interaction? Buschoff, Luca M. Schulze Voudouris, Konstantinos Demircan, Can Schulz, Eric Machine Learning Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
title	Can Vision Language Models Learn Intuitive Physics from Interaction?
topic	Machine Learning
url	https://arxiv.org/abs/2602.06033

Similar Items