Saved in:
Bibliographic Details
Main Authors: Carvalho, Miguel, Martins, Bruno
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.02584
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.