Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rudman, William, Golovanevsky, Michal, Bar, Amir, Palit, Vedant, LeCun, Yann, Eickhoff, Carsten, Singh, Ritambhara
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2502.15969
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916915477217280
author	Rudman, William Golovanevsky, Michal Bar, Amir Palit, Vedant LeCun, Yann Eickhoff, Carsten Singh, Ritambhara
author_facet	Rudman, William Golovanevsky, Michal Bar, Amir Palit, Vedant LeCun, Yann Eickhoff, Carsten Singh, Ritambhara
contents	Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_15969
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Forgotten Polygons: Multimodal Large Language Models are Shape-Blind Rudman, William Golovanevsky, Michal Bar, Amir Palit, Vedant LeCun, Yann Eickhoff, Carsten Singh, Ritambhara Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
title	Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2502.15969

Similar Items