Saved in:
Bibliographic Details
Main Authors: Li, Zongxia, Yu, Wenhao, Huang, Chengsong, Liang, Zhenwen, Liu, Rui, Liu, Fuxiao, Che, Jingxi, Yu, Dian, Boyd-Graber, Jordan, Mi, Haitao, Yu, Dong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.19652
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911624405712896
author Li, Zongxia
Yu, Wenhao
Huang, Chengsong
Liang, Zhenwen
Liu, Rui
Liu, Fuxiao
Che, Jingxi
Yu, Dian
Boyd-Graber, Jordan
Mi, Haitao
Yu, Dong
author_facet Li, Zongxia
Yu, Wenhao
Huang, Chengsong
Liang, Zhenwen
Liu, Rui
Liu, Fuxiao
Che, Jingxi
Yu, Dian
Boyd-Graber, Jordan
Mi, Haitao
Yu, Dong
contents Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.
format Preprint
id arxiv_https___arxiv_org_abs_2508_19652
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Self-Rewarding Vision-Language Model via Reasoning Decomposition
Li, Zongxia
Yu, Wenhao
Huang, Chengsong
Liang, Zhenwen
Liu, Rui
Liu, Fuxiao
Che, Jingxi
Yu, Dian
Boyd-Graber, Jordan
Mi, Haitao
Yu, Dong
Computer Vision and Pattern Recognition
Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.
title Self-Rewarding Vision-Language Model via Reasoning Decomposition
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2508.19652