Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Weng, Tengjin, Wang, Jingyi, Jiang, Wenhao, Ming, Zhong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.14939
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908473691734016
author	Weng, Tengjin Wang, Jingyi Jiang, Wenhao Ming, Zhong
author_facet	Weng, Tengjin Wang, Jingyi Jiang, Wenhao Ming, Zhong
contents	Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_14939
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VisNumBench: Evaluating Number Sense of Multimodal Large Language Models Weng, Tengjin Wang, Jingyi Jiang, Wenhao Ming, Zhong Computer Vision and Pattern Recognition Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.
title	VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.14939

Similar Items