Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mak, Chak-Wing, Zhu, Guanyu, Zhang, Boyi, Li, Hongji, Chi, Xiaowei, Zhang, Kevin, Wu, Yichen, He, Yangfan, Fan, Chun-Kai, Lu, Wentao, Ge, Kuangzhi, Fang, Xinyu, He, Hongyang, Lu, Kuan, Xu, Tianxiang, Zhang, Li, Ni, Yongxin, Li, Youhua, Zhang, Shanghang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.16007
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915747584802816
author	Mak, Chak-Wing Zhu, Guanyu Zhang, Boyi Li, Hongji Chi, Xiaowei Zhang, Kevin Wu, Yichen He, Yangfan Fan, Chun-Kai Lu, Wentao Ge, Kuangzhi Fang, Xinyu He, Hongyang Lu, Kuan Xu, Tianxiang Zhang, Li Ni, Yongxin Li, Youhua Zhang, Shanghang
author_facet	Mak, Chak-Wing Zhu, Guanyu Zhang, Boyi Li, Hongji Chi, Xiaowei Zhang, Kevin Wu, Yichen He, Yangfan Fan, Chun-Kai Lu, Wentao Ge, Kuangzhi Fang, Xinyu He, Hongyang Lu, Kuan Xu, Tianxiang Zhang, Li Ni, Yongxin Li, Youhua Zhang, Shanghang
contents	Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_16007
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models Mak, Chak-Wing Zhu, Guanyu Zhang, Boyi Li, Hongji Chi, Xiaowei Zhang, Kevin Wu, Yichen He, Yangfan Fan, Chun-Kai Lu, Wentao Ge, Kuangzhi Fang, Xinyu He, Hongyang Lu, Kuan Xu, Tianxiang Zhang, Li Ni, Yongxin Li, Youhua Zhang, Shanghang Computer Vision and Pattern Recognition Artificial Intelligence Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
title	PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2601.16007

Similar Items