Saved in:
Bibliographic Details
Main Authors: Zhang, Jiahui, Chen, Yurui, Zhou, Yanpeng, Xu, Yueming, Huang, Ze, Mei, Jilin, Chen, Junhui, Yuan, Yu-Jie, Cai, Xinyue, Huang, Guowei, Quan, Xingyue, Xu, Hang, Zhang, Li
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.22976
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912815668789248
author Zhang, Jiahui
Chen, Yurui
Zhou, Yanpeng
Xu, Yueming
Huang, Ze
Mei, Jilin
Chen, Junhui
Yuan, Yu-Jie
Cai, Xinyue
Huang, Guowei
Quan, Xingyue
Xu, Hang
Zhang, Li
author_facet Zhang, Jiahui
Chen, Yurui
Zhou, Yanpeng
Xu, Yueming
Huang, Ze
Mei, Jilin
Chen, Junhui
Yuan, Yu-Jie
Cai, Xinyue
Huang, Guowei
Quan, Xingyue
Xu, Hang
Zhang, Li
contents Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2503_22976
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Zhang, Jiahui
Chen, Yurui
Zhou, Yanpeng
Xu, Yueming
Huang, Ze
Mei, Jilin
Chen, Junhui
Yuan, Yu-Jie
Cai, Xinyue
Huang, Guowei
Quan, Xingyue
Xu, Hang
Zhang, Li
Computer Vision and Pattern Recognition
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
title From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.22976