Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ai, Rui, Pan, Yu, Simchi-Levi, David, Wang, Chonghuan
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.29871
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918420591673344
author	Ai, Rui Pan, Yu Simchi-Levi, David Wang, Chonghuan
author_facet	Ai, Rui Pan, Yu Simchi-Levi, David Wang, Chonghuan
contents	In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_29871
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training Ai, Rui Pan, Yu Simchi-Levi, David Wang, Chonghuan Artificial Intelligence In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
title	ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.29871

Similar Items