Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	He, Li, Qu, Qiang, Zhao, He, Wan, Stephen, Wang, Dadong, Yao, Lina, Liu, Tongliang
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.11523
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911442495602688
author	He, Li Qu, Qiang Zhao, He Wan, Stephen Wang, Dadong Yao, Lina Liu, Tongliang
author_facet	He, Li Qu, Qiang Zhao, He Wan, Stephen Wang, Dadong Yao, Lina Liu, Tongliang
contents	Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($π_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($π_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $π_0$ and $π_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_11523
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unifying Stable Optimization and Reference Regularization in RLHF He, Li Qu, Qiang Zhao, He Wan, Stephen Wang, Dadong Yao, Lina Liu, Tongliang Machine Learning Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($π_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($π_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $π_0$ and $π_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
title	Unifying Stable Optimization and Reference Regularization in RLHF
topic	Machine Learning
url	https://arxiv.org/abs/2602.11523

Similar Items