Saved in:
Bibliographic Details
Main Authors: Huang, Wenke, Zhang, Quan, Fang, Yiyang, Liang, Jian, Rong, Xuankun, Yao, Huanjin, Wan, Guancheng, Liang, Ke, He, Wenwen, Li, Mingjun, Rutkowski, Leszek, Ye, Mang, Du, Bo, Tao, Dacheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.18849
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912604026306560
author Huang, Wenke
Zhang, Quan
Fang, Yiyang
Liang, Jian
Rong, Xuankun
Yao, Huanjin
Wan, Guancheng
Liang, Ke
He, Wenwen
Li, Mingjun
Rutkowski, Leszek
Ye, Mang
Du, Bo
Tao, Dacheng
author_facet Huang, Wenke
Zhang, Quan
Fang, Yiyang
Liang, Jian
Rong, Xuankun
Yao, Huanjin
Wan, Guancheng
Liang, Ke
He, Wenwen
Li, Mingjun
Rutkowski, Leszek
Ye, Mang
Du, Bo
Tao, Dacheng
contents Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
format Preprint
id arxiv_https___arxiv_org_abs_2509_18849
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MAPO: Mixed Advantage Policy Optimization
Huang, Wenke
Zhang, Quan
Fang, Yiyang
Liang, Jian
Rong, Xuankun
Yao, Huanjin
Wan, Guancheng
Liang, Ke
He, Wenwen
Li, Mingjun
Rutkowski, Leszek
Ye, Mang
Du, Bo
Tao, Dacheng
Artificial Intelligence
Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
title MAPO: Mixed Advantage Policy Optimization
topic Artificial Intelligence
url https://arxiv.org/abs/2509.18849