Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yanxin, He, Liang, Kang, Zeyi, Ming, Zuheng, Zhao, Kaixing
Format:	Preprint
Published:	2025
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2509.18005
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912599668424704
author	Zhang, Yanxin He, Liang Kang, Zeyi Ming, Zuheng Zhao, Kaixing
author_facet	Zhang, Yanxin He, Liang Kang, Zeyi Ming, Zuheng Zhao, Kaixing
contents	In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_18005
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer Zhang, Yanxin He, Liang Kang, Zeyi Ming, Zuheng Zhao, Kaixing Robotics In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.
title	M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
topic	Robotics
url	https://arxiv.org/abs/2509.18005

Similar Items