Saved in:
Bibliographic Details
Main Authors: Gao, Haowen, Zhang, Zhenyu, Pang, Liang, Guo, Fangda, Dou, Hongjian, Lv, Guannan, Liu, Shaoguo, Gao, Tingting, Shen, Huawei, Cheng, Xueqi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.01106
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO