Saved in:
Bibliographic Details
Main Authors: Li, Zhaowei, Wang, Wei, Cai, YiQing, Qi, Xu, Wang, Pengyu, Zhang, Dong, Song, Hang, Jiang, Botian, Huang, Zhida, Wang, Tao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.02503
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929449655599104
author Li, Zhaowei
Wang, Wei
Cai, YiQing
Qi, Xu
Wang, Pengyu
Zhang, Dong
Song, Hang
Jiang, Botian
Huang, Zhida
Wang, Tao
author_facet Li, Zhaowei
Wang, Wei
Cai, YiQing
Qi, Xu
Wang, Pengyu
Zhang, Dong
Song, Hang
Jiang, Botian
Huang, Zhida
Wang, Tao
contents Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{https://github.com/lzw-lzw/UnifiedMLLM}.
format Preprint
id arxiv_https___arxiv_org_abs_2408_02503
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
Li, Zhaowei
Wang, Wei
Cai, YiQing
Qi, Xu
Wang, Pengyu
Zhang, Dong
Song, Hang
Jiang, Botian
Huang, Zhida
Wang, Tao
Computation and Language
Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{https://github.com/lzw-lzw/UnifiedMLLM}.
title UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
topic Computation and Language
url https://arxiv.org/abs/2408.02503