:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xu, Wenyuan, Zuo, Xiaochen, Xin, Chao, Yue, Yu, Yan, Lin, Wu, Yonghui
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2504.04950
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)

Policy Filtration for RLHF to Mitigate Noise in Reward Models
by: Zhang, Chuheng, et al.
Published: (2024)

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
by: Jiang, Yuhua, et al.
Published: (2025)

Policy Optimization in RLHF: The Impact of Out-of-preference Data
by: Li, Ziniu, et al.
Published: (2023)

Factored Causal Representation Learning for Robust Reward Modeling in RLHF
by: Yang, Yupei, et al.
Published: (2026)

Learning a Pessimistic Reward Model in RLHF
by: Xu, Yinglun, et al.
Published: (2025)

Optimal Design for Reward Modeling in RLHF
by: Scheid, Antoine, et al.
Published: (2024)

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
by: Lu, Taiming, et al.
Published: (2024)

RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)

Reward Generalization in RLHF: A Topological Perspective
by: Qiu, Tianyi, et al.
Published: (2024)

How to Evaluate Reward Models for RLHF
by: Frick, Evan, et al.
Published: (2024)

Unifying Stable Optimization and Reference Regularization in RLHF
by: He, Li, et al.
Published: (2026)

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
by: Huang, Jiawei, et al.
Published: (2025)

Beyond RLHF: A Unified Theoretical Framework of Alignment
by: Yun, Jihun, et al.
Published: (2025)

Provably Efficient Online RLHF with One-Pass Reward Modeling
by: Li, Long-Fei, et al.
Published: (2025)

Group Robust Preference Optimization in Reward-free RLHF
by: Ramesh, Shyam Sundhar, et al.
Published: (2024)

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons
by: Chen, Zhirui, et al.
Published: (2024)

A Theoretical Framework for Partially Observed Reward-States in RLHF
by: Kausik, Chinmaya, et al.
Published: (2024)

Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)

Dataset Reset Policy Optimization for RLHF
by: Chang, Jonathan D., et al.
Published: (2024)

Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)

Quantile Regression for Distributional Reward Models in RLHF
by: Dorka, Nicolai
Published: (2024)

Information-Theoretic Reward Decomposition for Generalizable RLHF
by: Mao, Liyuan, et al.
Published: (2025)

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models
by: Wang, Austin, et al.
Published: (2026)

Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
by: Dai, Juntao, et al.
Published: (2025)

One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF
by: Cai, Xin
Published: (2025)

UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function
by: Wang, Zhichao, et al.
Published: (2024)

Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF
by: Liu, Jing
Published: (2025)

Efficient Federated RLHF via Zeroth-Order Policy Optimization
by: Wang, Deyi, et al.
Published: (2026)

Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
by: Chidambaram, Keertana, et al.
Published: (2026)

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
by: Hu, Jian, et al.
Published: (2024)

Bias Fitting to Mitigate Length Bias of Reward Model in RLHF
by: Zhao, Kangwen, et al.
Published: (2025)

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
by: Miao, Yuchun, et al.
Published: (2024)

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF
by: Xiong, Nuoya, et al.
Published: (2025)

Policy Optimization Algorithms in a Unified Framework
by: Wu, Shuang
Published: (2025)

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
by: Duan, Kaiwen, et al.
Published: (2025)

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
by: Li, Gang, et al.
Published: (2025)

A First-Order Logic-Based Alternative to Reward Models in RLHF
by: Jian, Chunjin, et al.
Published: (2025)

Towards a Theoretical Understanding to the Generalization of RLHF
by: Li, Zhaochun, et al.
Published: (2026)