Saved in:
Bibliographic Details
Main Authors: Weng, Tengjin, Wang, Jingyi, Jiang, Wenhao, Ming, Zhong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.14939
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908473691734016
author Weng, Tengjin
Wang, Jingyi
Jiang, Wenhao
Ming, Zhong
author_facet Weng, Tengjin
Wang, Jingyi
Jiang, Wenhao
Ming, Zhong
contents Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.
format Preprint
id arxiv_https___arxiv_org_abs_2503_14939
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Weng, Tengjin
Wang, Jingyi
Jiang, Wenhao
Ming, Zhong
Computer Vision and Pattern Recognition
Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.
title VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.14939