Saved in:
| Main Authors: | Li, Lichen, Zhou, Hengguang, Liang, Yijun, Zhou, Tianyi, Hsieh, Cho-Jui |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.23488 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Understanding Reward Hacking in Text-to-Image Reinforcement Learning
by: Hong, Yunqi, et al.
Published: (2026)
by: Hong, Yunqi, et al.
Published: (2026)
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
by: Zhou, Hengguang, et al.
Published: (2025)
by: Zhou, Hengguang, et al.
Published: (2025)
ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024)
by: Chen, Lichang, et al.
Published: (2024)
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
by: Li, Xirui, et al.
Published: (2024)
by: Li, Xirui, et al.
Published: (2024)
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)
by: Roth, Amit, et al.
Published: (2026)
Defining and Characterizing Reward Hacking
by: Skalse, Joar, et al.
Published: (2022)
by: Skalse, Joar, et al.
Published: (2022)
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
by: Khalifa, Muhammad, et al.
Published: (2026)
by: Khalifa, Muhammad, et al.
Published: (2026)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
by: Lian, Jiesong, et al.
Published: (2025)
by: Lian, Jiesong, et al.
Published: (2025)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
Reward Hacking Mitigation using Verifiable Composite Rewards
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)
Repairing Reward Functions with Feedback to Mitigate Reward Hacking
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)
by: Miao, Yuchun, et al.
Published: (2025)
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
by: Chen, Zihan, et al.
Published: (2025)
by: Chen, Zihan, et al.
Published: (2025)
Detecting and Suppressing Reward Hacking with Gradient Fingerprints
by: Wang, Songtao, et al.
Published: (2026)
by: Wang, Songtao, et al.
Published: (2026)
Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
by: Deshpande, Darshan, et al.
Published: (2026)
by: Deshpande, Darshan, et al.
Published: (2026)
EvilGenie: A Reward Hacking Benchmark
by: Gabor, Jonathan, et al.
Published: (2025)
by: Gabor, Jonathan, et al.
Published: (2025)
Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
by: Liu, Zixuan, et al.
Published: (2026)
by: Liu, Zixuan, et al.
Published: (2026)
Inference-Time Reward Hacking in Large Language Models
by: Khalaf, Hadi, et al.
Published: (2025)
by: Khalaf, Hadi, et al.
Published: (2025)
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)
by: Beigi, Mohammad, et al.
Published: (2026)
GARDO: Reinforcing Diffusion Models without Reward Hacking
by: He, Haoran, et al.
Published: (2025)
by: He, Haoran, et al.
Published: (2025)
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
by: Wang, Xiaohua, et al.
Published: (2026)
by: Wang, Xiaohua, et al.
Published: (2026)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
by: Miao, Yuchun, et al.
Published: (2024)
by: Miao, Yuchun, et al.
Published: (2024)
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
by: Singha, Disha
Published: (2026)
by: Singha, Disha
Published: (2026)
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)
by: Miao, Yuchun, et al.
Published: (2025)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
by: Wang, Chaoqi, et al.
Published: (2025)
by: Wang, Chaoqi, et al.
Published: (2025)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
by: Eisenstein, Jacob, et al.
Published: (2023)
by: Eisenstein, Jacob, et al.
Published: (2023)
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
by: Deng, Wenlong, et al.
Published: (2026)
by: Deng, Wenlong, et al.
Published: (2026)
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
by: Helff, Lukas, et al.
Published: (2026)
by: Helff, Lukas, et al.
Published: (2026)
Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
by: Song, Ruike, et al.
Published: (2025)
by: Song, Ruike, et al.
Published: (2025)
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
by: Yu, Zhuohao, et al.
Published: (2026)
by: Yu, Zhuohao, et al.
Published: (2026)
Hacking Predictors Means Hacking Cars: Using Sensitivity Analysis to Identify Trajectory Prediction Vulnerabilities for Autonomous Driving Security
by: Gibson, Marsalis, et al.
Published: (2024)
by: Gibson, Marsalis, et al.
Published: (2024)
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
by: Ono, Shinnosuke, et al.
Published: (2026)
by: Ono, Shinnosuke, et al.
Published: (2026)
Feedback Loops With Language Models Drive In-Context Reward Hacking
by: Pan, Alexander, et al.
Published: (2024)
by: Pan, Alexander, et al.
Published: (2024)
UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
by: Fu, Lingling, et al.
Published: (2025)
by: Fu, Lingling, et al.
Published: (2025)
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
by: Thaman, Kunvar
Published: (2026)
by: Thaman, Kunvar
Published: (2026)
When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
by: Wu, Rui, et al.
Published: (2026)
by: Wu, Rui, et al.
Published: (2026)
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
by: Laidlaw, Cassidy, et al.
Published: (2024)
by: Laidlaw, Cassidy, et al.
Published: (2024)
Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
by: Wu, Yusong, et al.
Published: (2025)
by: Wu, Yusong, et al.
Published: (2025)
Hacking Task Confounder in Meta-Learning
by: Wang, Jingyao, et al.
Published: (2023)
by: Wang, Jingyao, et al.
Published: (2023)
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
by: Rashidinejad, Paria, et al.
Published: (2024)
by: Rashidinejad, Paria, et al.
Published: (2024)
Similar Items
-
Understanding Reward Hacking in Text-to-Image Reinforcement Learning
by: Hong, Yunqi, et al.
Published: (2026) -
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
by: Zhou, Hengguang, et al.
Published: (2025) -
ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024) -
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
by: Li, Xirui, et al.
Published: (2024) -
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)