Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Bo, Li, Shuo, Tian, Runhe, Yang, Yang, Tang, Jixin, Zhou, Jinhao, Ma, Lin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.09498
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915287329144832
author	Zhang, Bo Li, Shuo Tian, Runhe Yang, Yang Tang, Jixin Zhou, Jinhao Ma, Lin
author_facet	Zhang, Bo Li, Shuo Tian, Runhe Yang, Yang Tang, Jixin Zhou, Jinhao Ma, Lin
contents	In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_09498
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput Zhang, Bo Li, Shuo Tian, Runhe Yang, Yang Tang, Jixin Zhou, Jinhao Ma, Lin Computer Vision and Pattern Recognition Artificial Intelligence In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.
title	Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2505.09498

Similar Items