Saved in:
Bibliographic Details
Main Authors: Anschel, Oron, Shoshan, Alon, Botach, Adam, Hakimi, Shunit Haviv, Gendler, Asaf, Baruch, Emanuel Ben, Bhonker, Nadav, Kviatkovsky, Igor, Aggarwal, Manoj, Medioni, Gerard
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.12596
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.