Saved in:
Bibliographic Details
Main Authors: Mak, Chak-Wing, Zhu, Guanyu, Zhang, Boyi, Li, Hongji, Chi, Xiaowei, Zhang, Kevin, Wu, Yichen, He, Yangfan, Fan, Chun-Kai, Lu, Wentao, Ge, Kuangzhi, Fang, Xinyu, He, Hongyang, Lu, Kuan, Xu, Tianxiang, Zhang, Li, Ni, Yongxin, Li, Youhua, Zhang, Shanghang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.16007
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915747584802816
author Mak, Chak-Wing
Zhu, Guanyu
Zhang, Boyi
Li, Hongji
Chi, Xiaowei
Zhang, Kevin
Wu, Yichen
He, Yangfan
Fan, Chun-Kai
Lu, Wentao
Ge, Kuangzhi
Fang, Xinyu
He, Hongyang
Lu, Kuan
Xu, Tianxiang
Zhang, Li
Ni, Yongxin
Li, Youhua
Zhang, Shanghang
author_facet Mak, Chak-Wing
Zhu, Guanyu
Zhang, Boyi
Li, Hongji
Chi, Xiaowei
Zhang, Kevin
Wu, Yichen
He, Yangfan
Fan, Chun-Kai
Lu, Wentao
Ge, Kuangzhi
Fang, Xinyu
He, Hongyang
Lu, Kuan
Xu, Tianxiang
Zhang, Li
Ni, Yongxin
Li, Youhua
Zhang, Shanghang
contents Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
format Preprint
id arxiv_https___arxiv_org_abs_2601_16007
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
Mak, Chak-Wing
Zhu, Guanyu
Zhang, Boyi
Li, Hongji
Chi, Xiaowei
Zhang, Kevin
Wu, Yichen
He, Yangfan
Fan, Chun-Kai
Lu, Wentao
Ge, Kuangzhi
Fang, Xinyu
He, Hongyang
Lu, Kuan
Xu, Tianxiang
Zhang, Li
Ni, Yongxin
Li, Youhua
Zhang, Shanghang
Computer Vision and Pattern Recognition
Artificial Intelligence
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
title PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2601.16007