Saved in:
Bibliographic Details
Main Authors: Lu, Xudong, Chen, Yinghao, Wu, Renshou, Gao, Haohao, Chen, Xi, Yang, Xue, Zhao, Xiangyu, Zhou, Aojun, Li, Fangyuan, Wen, Yafei, Chen, Xiaoxin, Ren, Shuai, Li, Hongsheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.06019
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913725523427328
author Lu, Xudong
Chen, Yinghao
Wu, Renshou
Gao, Haohao
Chen, Xi
Yang, Xue
Zhao, Xiangyu
Zhou, Aojun
Li, Fangyuan
Wen, Yafei
Chen, Xiaoxin
Ren, Shuai
Li, Hongsheng
author_facet Lu, Xudong
Chen, Yinghao
Wu, Renshou
Gao, Haohao
Chen, Xi
Yang, Xue
Zhao, Xiangyu
Zhou, Aojun
Li, Fangyuan
Wen, Yafei
Chen, Xiaoxin
Ren, Shuai
Li, Hongsheng
contents Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
format Preprint
id arxiv_https___arxiv_org_abs_2503_06019
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Lu, Xudong
Chen, Yinghao
Wu, Renshou
Gao, Haohao
Chen, Xi
Yang, Xue
Zhao, Xiangyu
Zhou, Aojun
Li, Fangyuan
Wen, Yafei
Chen, Xiaoxin
Ren, Shuai
Li, Hongsheng
Computation and Language
Computer Vision and Pattern Recognition
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
title GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
topic Computation and Language
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.06019