Saved in:
Bibliographic Details
Main Authors: Wang, Haicheng, Yu, Zhemeng, Spadaro, Gabriele, Ju, Chen, Quétu, Victor, Xiao, Shuai, Tartaglione, Enzo
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.02430
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915235364864000
author Wang, Haicheng
Yu, Zhemeng
Spadaro, Gabriele
Ju, Chen
Quétu, Victor
Xiao, Shuai
Tartaglione, Enzo
author_facet Wang, Haicheng
Yu, Zhemeng
Spadaro, Gabriele
Ju, Chen
Quétu, Victor
Xiao, Shuai
Tartaglione, Enzo
contents Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
format Preprint
id arxiv_https___arxiv_org_abs_2501_02430
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Wang, Haicheng
Yu, Zhemeng
Spadaro, Gabriele
Ju, Chen
Quétu, Victor
Xiao, Shuai
Tartaglione, Enzo
Computer Vision and Pattern Recognition
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
title FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.02430