Saved in:
Bibliographic Details
Main Authors: Wang, Jiaqi, Jiang, Hanqi, Liu, Yiheng, Ma, Chong, Zhang, Xu, Pan, Yi, Liu, Mengyuan, Gu, Peiran, Xia, Sichen, Li, Wenjun, Zhang, Yutong, Wu, Zihao, Liu, Zhengliang, Zhong, Tianyang, Ge, Bao, Zhang, Tuo, Qiang, Ning, Hu, Xintao, Jiang, Xi, Zhang, Xin, Zhang, Wei, Shen, Dinggang, Liu, Tianming, Zhang, Shu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.01319
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916344548556800
author Wang, Jiaqi
Jiang, Hanqi
Liu, Yiheng
Ma, Chong
Zhang, Xu
Pan, Yi
Liu, Mengyuan
Gu, Peiran
Xia, Sichen
Li, Wenjun
Zhang, Yutong
Wu, Zihao
Liu, Zhengliang
Zhong, Tianyang
Ge, Bao
Zhang, Tuo
Qiang, Ning
Hu, Xintao
Jiang, Xi
Zhang, Xin
Zhang, Wei
Shen, Dinggang
Liu, Tianming
Zhang, Shu
author_facet Wang, Jiaqi
Jiang, Hanqi
Liu, Yiheng
Ma, Chong
Zhang, Xu
Pan, Yi
Liu, Mengyuan
Gu, Peiran
Xia, Sichen
Li, Wenjun
Zhang, Yutong
Wu, Zihao
Liu, Zhengliang
Zhong, Tianyang
Ge, Bao
Zhang, Tuo
Qiang, Ning
Hu, Xintao
Jiang, Xi
Zhang, Xin
Zhang, Wei
Shen, Dinggang
Liu, Tianming
Zhang, Shu
contents In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.
format Preprint
id arxiv_https___arxiv_org_abs_2408_01319
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
Wang, Jiaqi
Jiang, Hanqi
Liu, Yiheng
Ma, Chong
Zhang, Xu
Pan, Yi
Liu, Mengyuan
Gu, Peiran
Xia, Sichen
Li, Wenjun
Zhang, Yutong
Wu, Zihao
Liu, Zhengliang
Zhong, Tianyang
Ge, Bao
Zhang, Tuo
Qiang, Ning
Hu, Xintao
Jiang, Xi
Zhang, Xin
Zhang, Wei
Shen, Dinggang
Liu, Tianming
Zhang, Shu
Artificial Intelligence
In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data types-including text, images, videos, audio, and physiological sequences-MLLMs address the complexities of real-world applications far beyond the capabilities of single-modality systems. In this paper, we systematically sort out the applications of MLLM in multimodal tasks such as natural language, vision, and audio. We also provide a comparative analysis of the focus of different MLLMs in the tasks, and provide insights into the shortcomings of current MLLMs, and suggest potential directions for future research. Through these discussions, this paper hopes to provide valuable insights for the further development and application of MLLM.
title A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
topic Artificial Intelligence
url https://arxiv.org/abs/2408.01319