Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Jiahao, Yuan, Yufeng, Zheng, Rujie, Lin, Youtian, Gao, Jian, Chen, Lin-Zhuo, Bao, Yajie, Zhang, Yi, Zeng, Chang, Zhou, Yanxi, Long, Xiao-Xiao, Zhu, Hao, Zhang, Zhaoxiang, Cao, Xun, Yao, Yao
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.09676
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911324760440832
author	Wang, Jiahao Yuan, Yufeng Zheng, Rujie Lin, Youtian Gao, Jian Chen, Lin-Zhuo Bao, Yajie Zhang, Yi Zeng, Chang Zhou, Yanxi Long, Xiao-Xiao Zhu, Hao Zhang, Zhaoxiang Cao, Xun Yao, Yao
author_facet	Wang, Jiahao Yuan, Yufeng Zheng, Rujie Lin, Youtian Gao, Jian Chen, Lin-Zhuo Bao, Yajie Zhang, Yi Zeng, Chang Zhou, Yanxi Long, Xiao-Xiao Zhu, Hao Zhang, Zhaoxiang Cao, Xun Yao, Yao
contents	Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly fosters improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_09676
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SpatialVID: A Large-Scale Video Dataset with Spatial Annotations Wang, Jiahao Yuan, Yufeng Zheng, Rujie Lin, Youtian Gao, Jian Chen, Lin-Zhuo Bao, Yajie Zhang, Yi Zeng, Chang Zhou, Yanxi Long, Xiao-Xiao Zhu, Hao Zhang, Zhaoxiang Cao, Xun Yao, Yao Computer Vision and Pattern Recognition Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly fosters improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
title	SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.09676

Similar Items