Saved in:
Bibliographic Details
Main Authors: Xing, Yifei, Lan, Xiangyuan, Wang, Ruiping, Jiang, Dongmei, Huang, Wenjun, Zheng, Qingfang, Wang, Yaowei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.05938
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913537783234560
author Xing, Yifei
Lan, Xiangyuan
Wang, Ruiping
Jiang, Dongmei
Huang, Wenjun
Zheng, Qingfang
Wang, Yaowei
author_facet Xing, Yifei
Lan, Xiangyuan
Wang, Ruiping
Jiang, Dongmei
Huang, Wenjun
Zheng, Qingfang
Wang, Yaowei
contents Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
format Preprint
id arxiv_https___arxiv_org_abs_2410_05938
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Xing, Yifei
Lan, Xiangyuan
Wang, Ruiping
Jiang, Dongmei
Huang, Wenjun
Zheng, Qingfang
Wang, Yaowei
Computer Vision and Pattern Recognition
Artificial Intelligence
Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
title EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2410.05938