Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Anschel, Oron, Shoshan, Alon, Botach, Adam, Hakimi, Shunit Haviv, Gendler, Asaf, Baruch, Emanuel Ben, Bhonker, Nadav, Kviatkovsky, Igor, Aggarwal, Manoj, Medioni, Gerard
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2511.12596
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

Similar Items