Saved in:
Bibliographic Details
Main Authors: Ye, Zhoutong, Sun, Mingze, Gao, Huan-ang, Wang, Xutong, Wang, Xiangyang, Mei, Yu, Liu, Chang, Li, Qinwei, Zhang, Chengwen, Lan, Qinghuan, Yu, Chun, Shi, Yuanchun
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.09348
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908706392768512
author Ye, Zhoutong
Sun, Mingze
Gao, Huan-ang
Wang, Xutong
Wang, Xiangyang
Mei, Yu
Liu, Chang
Li, Qinwei
Zhang, Chengwen
Lan, Qinghuan
Yu, Chun
Shi, Yuanchun
author_facet Ye, Zhoutong
Sun, Mingze
Gao, Huan-ang
Wang, Xutong
Wang, Xiangyang
Mei, Yu
Liu, Chang
Li, Qinwei
Zhang, Chengwen
Lan, Qinghuan
Yu, Chun
Shi, Yuanchun
contents Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
format Preprint
id arxiv_https___arxiv_org_abs_2503_09348
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Ye, Zhoutong
Sun, Mingze
Gao, Huan-ang
Wang, Xutong
Wang, Xiangyang
Mei, Yu
Liu, Chang
Li, Qinwei
Zhang, Chengwen
Lan, Qinghuan
Yu, Chun
Shi, Yuanchun
Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
title MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
topic Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.09348