Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Yi, Xu, Xiao, Xu, Zeyu, Zhang, Meng, Li, Yibo, Chen, Haoyu, Zhang, Junkang, Wang, Qiang, Sun, Jifa, Lin, Siling, Cheng, Shengxun, Zhang, Lingshu, Wang, Kang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.01540
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911088405118976
author	Liu, Yi Xu, Xiao Xu, Zeyu Zhang, Meng Li, Yibo Chen, Haoyu Zhang, Junkang Wang, Qiang Sun, Jifa Lin, Siling Cheng, Shengxun Zhang, Lingshu Wang, Kang
author_facet	Liu, Yi Xu, Xiao Xu, Zeyu Zhang, Meng Li, Yibo Chen, Haoyu Zhang, Junkang Wang, Qiang Sun, Jifa Lin, Siling Cheng, Shengxun Zhang, Lingshu Wang, Kang
contents	Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model's performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_01540
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning Liu, Yi Xu, Xiao Xu, Zeyu Zhang, Meng Li, Yibo Chen, Haoyu Zhang, Junkang Wang, Qiang Sun, Jifa Lin, Siling Cheng, Shengxun Zhang, Lingshu Wang, Kang Computer Vision and Pattern Recognition Artificial Intelligence Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model's performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.
title	MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2508.01540

Similar Items