Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.07449 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909166651572224 |
|---|---|
| author | Ranasinghe, Kanchana Shukla, Satya Narayan Poursaeed, Omid Ryoo, Michael S. Lin, Tsung-Yu |
| author_facet | Ranasinghe, Kanchana Shukla, Satya Narayan Poursaeed, Omid Ryoo, Michael S. Lin, Tsung-Yu |
| contents | Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2404_07449 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs Ranasinghe, Kanchana Shukla, Satya Narayan Poursaeed, Omid Ryoo, Michael S. Lin, Tsung-Yu Computer Vision and Pattern Recognition Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework. |
| title | Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2404.07449 |