Saved in:
Bibliographic Details
Main Authors: Gao, Haowen, Zhang, Zhenyu, Pang, Liang, Guo, Fangda, Dou, Hongjian, Lv, Guannan, Liu, Shaoguo, Gao, Tingting, Shen, Huawei, Cheng, Xueqi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.01106
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910036947632128
author Gao, Haowen
Zhang, Zhenyu
Pang, Liang
Guo, Fangda
Dou, Hongjian
Lv, Guannan
Liu, Shaoguo
Gao, Tingting
Shen, Huawei
Cheng, Xueqi
author_facet Gao, Haowen
Zhang, Zhenyu
Pang, Liang
Guo, Fangda
Dou, Hongjian
Lv, Guannan
Liu, Shaoguo
Gao, Tingting
Shen, Huawei
Cheng, Xueqi
contents Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
format Preprint
id arxiv_https___arxiv_org_abs_2603_01106
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
Gao, Haowen
Zhang, Zhenyu
Pang, Liang
Guo, Fangda
Dou, Hongjian
Lv, Guannan
Liu, Shaoguo
Gao, Tingting
Shen, Huawei
Cheng, Xueqi
Artificial Intelligence
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
title DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
topic Artificial Intelligence
url https://arxiv.org/abs/2603.01106