Saved in:
Bibliographic Details
Main Authors: Niu, Junbo, Zheng, Yuanhong, Miao, Ziyang, Dong, Hejun, Ge, Chunjiang, Liang, Hao, Lu, Ma, Zeng, Bohan, Zheng, Qiahao, He, Conghui, Zhang, Wentao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.12776
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.