Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Miao, Ziqi, Jia, Haonan, Li, Lijun, Qian, Chen, Xiong, Yuan, Yan, Wenting, Shao, Jing
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.28618
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911577908707328
author	Miao, Ziqi Jia, Haonan Li, Lijun Qian, Chen Xiong, Yuan Yan, Wenting Shao, Jing
author_facet	Miao, Ziqi Jia, Haonan Li, Lijun Qian, Chen Xiong, Yuan Yan, Wenting Shao, Jing
contents	Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_28618
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning Miao, Ziqi Jia, Haonan Li, Lijun Qian, Chen Xiong, Yuan Yan, Wenting Shao, Jing Artificial Intelligence Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
title	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.28618

Similar Items