Saved in:
Bibliographic Details
Main Authors: Li, Yuying, Qian, Siyi, Liang, Hao, Zheng, Leqi, An, Ruichuan, Guo, Yongzhen, Zhang, Wentao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.09302
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914086859571200
author Li, Yuying
Qian, Siyi
Liang, Hao
Zheng, Leqi
An, Ruichuan
Guo, Yongzhen
Zhang, Wentao
author_facet Li, Yuying
Qian, Siyi
Liang, Hao
Zheng, Leqi
An, Ruichuan
Guo, Yongzhen
Zhang, Wentao
contents Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2510_09302
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Li, Yuying
Qian, Siyi
Liang, Hao
Zheng, Leqi
An, Ruichuan
Guo, Yongzhen
Zhang, Wentao
Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
title CapGeo: A Caption-Assisted Approach to Geometric Reasoning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2510.09302