Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ye, Zhoutong, Sun, Mingze, Gao, Huan-ang, Wang, Xutong, Wang, Xiangyang, Mei, Yu, Liu, Chang, Li, Qinwei, Zhang, Chengwen, Lan, Qinghuan, Yu, Chun, Shi, Yuanchun
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.09348
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908706392768512
author	Ye, Zhoutong Sun, Mingze Gao, Huan-ang Wang, Xutong Wang, Xiangyang Mei, Yu Liu, Chang Li, Qinwei Zhang, Chengwen Lan, Qinghuan Yu, Chun Shi, Yuanchun
author_facet	Ye, Zhoutong Sun, Mingze Gao, Huan-ang Wang, Xutong Wang, Xiangyang Mei, Yu Liu, Chang Li, Qinwei Zhang, Chengwen Lan, Qinghuan Yu, Chun Shi, Yuanchun
contents	Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_09348
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding Ye, Zhoutong Sun, Mingze Gao, Huan-ang Wang, Xutong Wang, Xiangyang Mei, Yu Liu, Chang Li, Qinwei Zhang, Chengwen Lan, Qinghuan Yu, Chun Shi, Yuanchun Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
title	MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
topic	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.09348

Similar Items