Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xing, Yifei, Lan, Xiangyuan, Wang, Ruiping, Jiang, Dongmei, Huang, Wenjun, Zheng, Qingfang, Wang, Yaowei
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2410.05938
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913537783234560
author	Xing, Yifei Lan, Xiangyuan Wang, Ruiping Jiang, Dongmei Huang, Wenjun Zheng, Qingfang Wang, Yaowei
author_facet	Xing, Yifei Lan, Xiangyuan Wang, Ruiping Jiang, Dongmei Huang, Wenjun Zheng, Qingfang Wang, Yaowei
contents	Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_05938
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment Xing, Yifei Lan, Xiangyuan Wang, Ruiping Jiang, Dongmei Huang, Wenjun Zheng, Qingfang Wang, Yaowei Computer Vision and Pattern Recognition Artificial Intelligence Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
title	EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2410.05938

Similar Items