Saved in:
Bibliographic Details
Main Authors: Wang, Jiahao, Yuan, Yufeng, Zheng, Rujie, Lin, Youtian, Gao, Jian, Chen, Lin-Zhuo, Bao, Yajie, Zhang, Yi, Zeng, Chang, Zhou, Yanxi, Long, Xiao-Xiao, Zhu, Hao, Zhang, Zhaoxiang, Cao, Xun, Yao, Yao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.09676
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911324760440832
author Wang, Jiahao
Yuan, Yufeng
Zheng, Rujie
Lin, Youtian
Gao, Jian
Chen, Lin-Zhuo
Bao, Yajie
Zhang, Yi
Zeng, Chang
Zhou, Yanxi
Long, Xiao-Xiao
Zhu, Hao
Zhang, Zhaoxiang
Cao, Xun
Yao, Yao
author_facet Wang, Jiahao
Yuan, Yufeng
Zheng, Rujie
Lin, Youtian
Gao, Jian
Chen, Lin-Zhuo
Bao, Yajie
Zhang, Yi
Zeng, Chang
Zhou, Yanxi
Long, Xiao-Xiao
Zhu, Hao
Zhang, Zhaoxiang
Cao, Xun
Yao, Yao
contents Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly fosters improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
format Preprint
id arxiv_https___arxiv_org_abs_2509_09676
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Wang, Jiahao
Yuan, Yufeng
Zheng, Rujie
Lin, Youtian
Gao, Jian
Chen, Lin-Zhuo
Bao, Yajie
Zhang, Yi
Zeng, Chang
Zhou, Yanxi
Long, Xiao-Xiao
Zhu, Hao
Zhang, Zhaoxiang
Cao, Xun
Yao, Yao
Computer Vision and Pattern Recognition
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly fosters improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
title SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.09676