:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Lichen, Zhou, Hengguang, Liang, Yijun, Zhou, Tianyi, Hsieh, Cho-Jui
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2604.23488
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Understanding Reward Hacking in Text-to-Image Reinforcement Learning
by: Hong, Yunqi, et al.
Published: (2026)

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
by: Zhou, Hengguang, et al.
Published: (2025)

ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024)

MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
by: Li, Xirui, et al.
Published: (2024)

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)

Defining and Characterizing Reward Hacking
by: Skalse, Joar, et al.
Published: (2022)

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
by: Khalifa, Muhammad, et al.
Published: (2026)

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
by: Lian, Jiesong, et al.
Published: (2025)

Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)

Reward Hacking Mitigation using Verifiable Composite Rewards
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)

Repairing Reward Functions with Feedback to Mitigate Reward Hacking
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
by: Chen, Zihan, et al.
Published: (2025)

Detecting and Suppressing Reward Hacking with Gradient Fingerprints
by: Wang, Songtao, et al.
Published: (2026)

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
by: Deshpande, Darshan, et al.
Published: (2026)

EvilGenie: A Reward Hacking Benchmark
by: Gabor, Jonathan, et al.
Published: (2025)

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
by: Liu, Zixuan, et al.
Published: (2026)

Inference-Time Reward Hacking in Large Language Models
by: Khalaf, Hadi, et al.
Published: (2025)

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)

GARDO: Reinforcing Diffusion Models without Reward Hacking
by: He, Haoran, et al.
Published: (2025)

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
by: Wang, Xiaohua, et al.
Published: (2026)

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
by: Miao, Yuchun, et al.
Published: (2024)

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
by: Singha, Disha
Published: (2026)

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
by: Wang, Chaoqi, et al.
Published: (2025)

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
by: Eisenstein, Jacob, et al.
Published: (2023)

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
by: Deng, Wenlong, et al.
Published: (2026)

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
by: Helff, Lukas, et al.
Published: (2026)

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
by: Song, Ruike, et al.
Published: (2025)

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
by: Yu, Zhuohao, et al.
Published: (2026)

Hacking Predictors Means Hacking Cars: Using Sensitivity Analysis to Identify Trajectory Prediction Vulnerabilities for Autonomous Driving Security
by: Gibson, Marsalis, et al.
Published: (2024)

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
by: Ono, Shinnosuke, et al.
Published: (2026)

Feedback Loops With Language Models Drive In-Context Reward Hacking
by: Pan, Alexander, et al.
Published: (2024)

UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
by: Fu, Lingling, et al.
Published: (2025)

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
by: Thaman, Kunvar
Published: (2026)

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
by: Wu, Rui, et al.
Published: (2026)

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
by: Laidlaw, Cassidy, et al.
Published: (2024)

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
by: Wu, Yusong, et al.
Published: (2025)

Hacking Task Confounder in Meta-Learning
by: Wang, Jingyao, et al.
Published: (2023)

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
by: Rashidinejad, Paria, et al.
Published: (2024)