MARC erregistroa: :: Library Catalog

Gorde:

Xehetasun bibliografikoak
Egile Nagusiak:	Kumar, Sunil, Zhao, Bowen, Dirac, Leo, Varshavskaya, Paulina
Formatua:	Preprint
Argitaratua:	2025
Gaiak:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
Sarrera elektronikoa:	https://arxiv.org/abs/2506.14821
Etiketak:	Etiketa erantsi Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!

_version_	1866916880564879360
author	Kumar, Sunil Zhao, Bowen Dirac, Leo Varshavskaya, Paulina
author_facet	Kumar, Sunil Zhao, Bowen Dirac, Leo Varshavskaya, Paulina
contents	Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_14821
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints Kumar, Sunil Zhao, Bowen Dirac, Leo Varshavskaya, Paulina Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.
title	Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints
topic	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2506.14821

Antzeko izenburuak