Saved in:
Bibliographic Details
Main Authors: Wang, Yipu, Ji, Yuheng, Liu, Yuyang, Zhou, Enshen, Yang, Ziqiang, Tian, Yuxuan, Qin, Ziheng, Liu, Yue, Tan, Huajie, Chi, Cheng, Ma, Zhiyuan, Zeng, Daniel Dajun, Zheng, Xiaolong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.04686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917130771890176
author Wang, Yipu
Ji, Yuheng
Liu, Yuyang
Zhou, Enshen
Yang, Ziqiang
Tian, Yuxuan
Qin, Ziheng
Liu, Yue
Tan, Huajie
Chi, Cheng
Ma, Zhiyuan
Zeng, Daniel Dajun
Zheng, Xiaolong
author_facet Wang, Yipu
Ji, Yuheng
Liu, Yuyang
Zhou, Enshen
Yang, Ziqiang
Tian, Yuxuan
Qin, Ziheng
Liu, Yue
Tan, Huajie
Chi, Cheng
Ma, Zhiyuan
Zeng, Daniel Dajun
Zheng, Xiaolong
contents Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
format Preprint
id arxiv_https___arxiv_org_abs_2512_04686
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Towards Cross-View Point Correspondence in Vision-Language Models
Wang, Yipu
Ji, Yuheng
Liu, Yuyang
Zhou, Enshen
Yang, Ziqiang
Tian, Yuxuan
Qin, Ziheng
Liu, Yue
Tan, Huajie
Chi, Cheng
Ma, Zhiyuan
Zeng, Daniel Dajun
Zheng, Xiaolong
Computer Vision and Pattern Recognition
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
title Towards Cross-View Point Correspondence in Vision-Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.04686