Saved in:
Bibliographic Details
Main Authors: Li, Lichen, Zhou, Hengguang, Liang, Yijun, Zhou, Tianyi, Hsieh, Cho-Jui
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.23488
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918468317609984
author Li, Lichen
Zhou, Hengguang
Liang, Yijun
Zhou, Tianyi
Hsieh, Cho-Jui
author_facet Li, Lichen
Zhou, Hengguang
Liang, Yijun
Zhou, Tianyi
Hsieh, Cho-Jui
contents Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling-until-hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to "in-the-wild" hacking, and (2) monitors trained on our "in-the-wild" trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at https://github.com/LichenLillc/CoTMonitoring.git
format Preprint
id arxiv_https___arxiv_org_abs_2604_23488
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Li, Lichen
Zhou, Hengguang
Liang, Yijun
Zhou, Tianyi
Hsieh, Cho-Jui
Machine Learning
Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling-until-hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to "in-the-wild" hacking, and (2) monitors trained on our "in-the-wild" trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at https://github.com/LichenLillc/CoTMonitoring.git
title Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
topic Machine Learning
url https://arxiv.org/abs/2604.23488