Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Jianjian, Fan, Junquan, Tang, Feng, Huang, Gang, Zhu, Shitao, Liu, Songlin, Xie, Nian, Liu, Wulong, Liao, Yong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.18512
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917936354033664
author	Li, Jianjian Fan, Junquan Tang, Feng Huang, Gang Zhu, Shitao Liu, Songlin Xie, Nian Liu, Wulong Liao, Yong
author_facet	Li, Jianjian Fan, Junquan Tang, Feng Huang, Gang Zhu, Shitao Liu, Songlin Xie, Nian Liu, Wulong Liao, Yong
contents	The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_18512
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression Li, Jianjian Fan, Junquan Tang, Feng Huang, Gang Zhu, Shitao Liu, Songlin Xie, Nian Liu, Wulong Liao, Yong Computer Vision and Pattern Recognition Artificial Intelligence The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.
title	FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2502.18512

Similar Items