Saved in:
Bibliographic Details
Main Authors: Fei, Hao, Zhou, Yuan, Li, Juncheng, Li, Xiangtai, Xu, Qingshan, Li, Bobo, Wu, Shengqiong, Wang, Yaoting, Zhou, Junbao, Meng, Jiahao, Shi, Qingyu, Zhou, Zhiyuan, Shi, Liangtao, Gao, Minghe, Zhang, Daoan, Ge, Zhiqi, Wu, Weiming, Tang, Siliang, Pan, Kaihang, Ye, Yaobo, Yuan, Haobo, Zhang, Tao, Ju, Tianjie, Meng, Zixiang, Xu, Shilin, Jia, Liyu, Hu, Wentao, Luo, Meng, Luo, Jiebo, Chua, Tat-Seng, Yan, Shuicheng, Zhang, Hanwang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.04620
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910931633569792
author Fei, Hao
Zhou, Yuan
Li, Juncheng
Li, Xiangtai
Xu, Qingshan
Li, Bobo
Wu, Shengqiong
Wang, Yaoting
Zhou, Junbao
Meng, Jiahao
Shi, Qingyu
Zhou, Zhiyuan
Shi, Liangtao
Gao, Minghe
Zhang, Daoan
Ge, Zhiqi
Wu, Weiming
Tang, Siliang
Pan, Kaihang
Ye, Yaobo
Yuan, Haobo
Zhang, Tao
Ju, Tianjie
Meng, Zixiang
Xu, Shilin
Jia, Liyu
Hu, Wentao
Luo, Meng
Luo, Jiebo
Chua, Tat-Seng
Yan, Shuicheng
Zhang, Hanwang
author_facet Fei, Hao
Zhou, Yuan
Li, Juncheng
Li, Xiangtai
Xu, Qingshan
Li, Bobo
Wu, Shengqiong
Wang, Yaoting
Zhou, Junbao
Meng, Jiahao
Shi, Qingyu
Zhou, Zhiyuan
Shi, Liangtao
Gao, Minghe
Zhang, Daoan
Ge, Zhiqi
Wu, Weiming
Tang, Siliang
Pan, Kaihang
Ye, Yaobo
Yuan, Haobo
Zhang, Tao
Ju, Tianjie
Meng, Zixiang
Xu, Shilin
Jia, Liyu
Hu, Wentao
Luo, Meng
Luo, Jiebo
Chua, Tat-Seng
Yan, Shuicheng
Zhang, Hanwang
contents The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/
format Preprint
id arxiv_https___arxiv_org_abs_2505_04620
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle On Path to Multimodal Generalist: General-Level and General-Bench
Fei, Hao
Zhou, Yuan
Li, Juncheng
Li, Xiangtai
Xu, Qingshan
Li, Bobo
Wu, Shengqiong
Wang, Yaoting
Zhou, Junbao
Meng, Jiahao
Shi, Qingyu
Zhou, Zhiyuan
Shi, Liangtao
Gao, Minghe
Zhang, Daoan
Ge, Zhiqi
Wu, Weiming
Tang, Siliang
Pan, Kaihang
Ye, Yaobo
Yuan, Haobo
Zhang, Tao
Ju, Tianjie
Meng, Zixiang
Xu, Shilin
Jia, Liyu
Hu, Wentao
Luo, Meng
Luo, Jiebo
Chua, Tat-Seng
Yan, Shuicheng
Zhang, Hanwang
Computer Vision and Pattern Recognition
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/
title On Path to Multimodal Generalist: General-Level and General-Bench
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2505.04620