Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Yuying, Qian, Siyi, Liang, Hao, Zheng, Leqi, An, Ruichuan, Guo, Yongzhen, Zhang, Wentao
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2510.09302
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914086859571200
author	Li, Yuying Qian, Siyi Liang, Hao Zheng, Leqi An, Ruichuan Guo, Yongzhen Zhang, Wentao
author_facet	Li, Yuying Qian, Siyi Liang, Hao Zheng, Leqi An, Ruichuan Guo, Yongzhen Zhang, Wentao
contents	Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_09302
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CapGeo: A Caption-Assisted Approach to Geometric Reasoning Li, Yuying Qian, Siyi Liang, Hao Zheng, Leqi An, Ruichuan Guo, Yongzhen Zhang, Wentao Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
title	CapGeo: A Caption-Assisted Approach to Geometric Reasoning
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2510.09302

Similar Items