Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shao, Yawen, Xiao, Jie, Zhu, Kai, Liu, Yu, Zhai, Wei, Cao, Yang, Zha, Zheng-Jun
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2512.12387
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917144872091648
author	Shao, Yawen Xiao, Jie Zhu, Kai Liu, Yu Zhai, Wei Cao, Yang Zha, Zheng-Jun
author_facet	Shao, Yawen Xiao, Jie Zhu, Kai Liu, Yu Zhai, Wei Cao, Yang Zha, Zheng-Jun
contents	Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: https://yawen-shao.github.io/VGPO/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_12387
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment Shao, Yawen Xiao, Jie Zhu, Kai Liu, Yu Zhai, Wei Cao, Yang Zha, Zheng-Jun Machine Learning Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: https://yawen-shao.github.io/VGPO/.
title	Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment
topic	Machine Learning
url	https://arxiv.org/abs/2512.12387

Similar Items