Saved in:
Bibliographic Details
Main Authors: Buschoff, Luca M. Schulze, Voudouris, Konstantinos, Demircan, Can, Schulz, Eric
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.06033
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911740515581952
author Buschoff, Luca M. Schulze
Voudouris, Konstantinos
Demircan, Can
Schulz, Eric
author_facet Buschoff, Luca M. Schulze
Voudouris, Konstantinos
Demircan, Can
Schulz, Eric
contents Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
format Preprint
id arxiv_https___arxiv_org_abs_2602_06033
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Can Vision Language Models Learn Intuitive Physics from Interaction?
Buschoff, Luca M. Schulze
Voudouris, Konstantinos
Demircan, Can
Schulz, Eric
Machine Learning
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
title Can Vision Language Models Learn Intuitive Physics from Interaction?
topic Machine Learning
url https://arxiv.org/abs/2602.06033