Sisdoallologahallan: :: Library Catalog

Furkejuvvon:

Bibliográfalaš dieđut
Váldodahkkit:	Kumar, Sunil, Zhao, Bowen, Dirac, Leo, Varshavskaya, Paulina
Materiálatiipa:	Preprint
Almmustuhtton:	2025
Fáttát:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
Liŋkkat:	https://arxiv.org/abs/2506.14821
Fáddágilkorat:	Lasit fáddágilkoriid Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!

Sisdoallologahallan:

Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.

Geahča maid