_version_ 1866911465837953024
author Wang, Maijunxian
Wang, Ruisi
Lin, Juyi
Ji, Ran
Wiedemer, Thaddäus
Gao, Qingying
Luo, Dezhi
Qian, Yaoyao
Huang, Lianyu
Hong, Zelong
Ge, Jiahui
Ma, Qianli
He, Hang
Zhou, Yifan
Guo, Lingzi
Mei, Lantao
Li, Jiachen
Xing, Hanwen
Zhao, Tianqi
Yu, Fengyuan
Xiao, Weihang
Jiao, Yizheng
Hou, Jianheng
Zhang, Danyang
Xu, Pengcheng
Zhong, Boyang
Zhao, Zehong
Fang, Gaoyun
Kitaoka, John
Xu, Yile
Xu, Hua
Blacutt, Kenton
Nguyen, Tin
Song, Siyuan
Sun, Haoran
Wen, Shaoyue
He, Linyang
Wang, Runming
Wang, Yanzhi
Yang, Mengyue
Ma, Ziqiao
Millière, Raphaël
Shi, Freda
Vasconcelos, Nuno
Khashabi, Daniel
Yuille, Alan
Du, Yilun
Liu, Ziming
Li, Bo
Lin, Dahua
Liu, Ziwei
Kumar, Vikash
Li, Yijiang
Yang, Lei
Cai, Zhongang
Deng, Hokin
author_facet Wang, Maijunxian
Wang, Ruisi
Lin, Juyi
Ji, Ran
Wiedemer, Thaddäus
Gao, Qingying
Luo, Dezhi
Qian, Yaoyao
Huang, Lianyu
Hong, Zelong
Ge, Jiahui
Ma, Qianli
He, Hang
Zhou, Yifan
Guo, Lingzi
Mei, Lantao
Li, Jiachen
Xing, Hanwen
Zhao, Tianqi
Yu, Fengyuan
Xiao, Weihang
Jiao, Yizheng
Hou, Jianheng
Zhang, Danyang
Xu, Pengcheng
Zhong, Boyang
Zhao, Zehong
Fang, Gaoyun
Kitaoka, John
Xu, Yile
Xu, Hua
Blacutt, Kenton
Nguyen, Tin
Song, Siyuan
Sun, Haoran
Wen, Shaoyue
He, Linyang
Wang, Runming
Wang, Yanzhi
Yang, Mengyue
Ma, Ziqiao
Millière, Raphaël
Shi, Freda
Vasconcelos, Nuno
Khashabi, Daniel
Yuille, Alan
Du, Yilun
Liu, Ziming
Li, Bo
Lin, Dahua
Liu, Ziwei
Kumar, Vikash
Li, Yijiang
Yang, Lei
Cai, Zhongang
Deng, Hokin
contents Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
format Preprint
id arxiv_https___arxiv_org_abs_2602_20159
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle A Very Big Video Reasoning Suite
Wang, Maijunxian
Wang, Ruisi
Lin, Juyi
Ji, Ran
Wiedemer, Thaddäus
Gao, Qingying
Luo, Dezhi
Qian, Yaoyao
Huang, Lianyu
Hong, Zelong
Ge, Jiahui
Ma, Qianli
He, Hang
Zhou, Yifan
Guo, Lingzi
Mei, Lantao
Li, Jiachen
Xing, Hanwen
Zhao, Tianqi
Yu, Fengyuan
Xiao, Weihang
Jiao, Yizheng
Hou, Jianheng
Zhang, Danyang
Xu, Pengcheng
Zhong, Boyang
Zhao, Zehong
Fang, Gaoyun
Kitaoka, John
Xu, Yile
Xu, Hua
Blacutt, Kenton
Nguyen, Tin
Song, Siyuan
Sun, Haoran
Wen, Shaoyue
He, Linyang
Wang, Runming
Wang, Yanzhi
Yang, Mengyue
Ma, Ziqiao
Millière, Raphaël
Shi, Freda
Vasconcelos, Nuno
Khashabi, Daniel
Yuille, Alan
Du, Yilun
Liu, Ziming
Li, Bo
Lin, Dahua
Liu, Ziwei
Kumar, Vikash
Li, Yijiang
Yang, Lei
Cai, Zhongang
Deng, Hokin
Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
Multimedia
Robotics
Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
title A Very Big Video Reasoning Suite
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
Multimedia
Robotics
url https://arxiv.org/abs/2602.20159