Guardado en:
| Autores principales: | Song, Ruike, Song, Zeen, Guo, Huijie, Qiang, Wenwen |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2508.04216 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Reward Model Generalization for Compute-Aware Test-Time Reasoning
por: Song, Zeen, et al.
Publicado: (2025)
por: Song, Zeen, et al.
Publicado: (2025)
Hacking Task Confounder in Meta-Learning
por: Wang, Jingyao, et al.
Publicado: (2023)
por: Wang, Jingyao, et al.
Publicado: (2023)
Reward Shaping to Mitigate Reward Hacking in RLHF
por: Fu, Jiayi, et al.
Publicado: (2025)
por: Fu, Jiayi, et al.
Publicado: (2025)
Reward Hacking Mitigation using Verifiable Composite Rewards
por: Tarek, Mirza Farhan Bin, et al.
Publicado: (2025)
por: Tarek, Mirza Farhan Bin, et al.
Publicado: (2025)
Repairing Reward Functions with Feedback to Mitigate Reward Hacking
por: Hatgis-Kessell, Stephane, et al.
Publicado: (2025)
por: Hatgis-Kessell, Stephane, et al.
Publicado: (2025)
Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting
por: Zhang, Xingyu, et al.
Publicado: (2024)
por: Zhang, Xingyu, et al.
Publicado: (2024)
Adaptive Uncertainty-Aware Tree Search for Robust Reasoning
por: Song, Zeen, et al.
Publicado: (2026)
por: Song, Zeen, et al.
Publicado: (2026)
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
por: Beigi, Mohammad, et al.
Publicado: (2026)
por: Beigi, Mohammad, et al.
Publicado: (2026)
Learning to Reason without External Rewards
por: Zhao, Xuandong, et al.
Publicado: (2025)
por: Zhao, Xuandong, et al.
Publicado: (2025)
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
por: Singha, Disha
Publicado: (2026)
por: Singha, Disha
Publicado: (2026)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
por: Miao, Yuchun, et al.
Publicado: (2024)
por: Miao, Yuchun, et al.
Publicado: (2024)
From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
por: Zhou, Shuyi, et al.
Publicado: (2026)
por: Zhou, Shuyi, et al.
Publicado: (2026)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
por: Wang, Chaoqi, et al.
Publicado: (2025)
por: Wang, Chaoqi, et al.
Publicado: (2025)
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
por: Miao, Yuchun, et al.
Publicado: (2025)
por: Miao, Yuchun, et al.
Publicado: (2025)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
por: Eisenstein, Jacob, et al.
Publicado: (2023)
por: Eisenstein, Jacob, et al.
Publicado: (2023)
Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
por: Liu, Zixuan, et al.
Publicado: (2026)
por: Liu, Zixuan, et al.
Publicado: (2026)
Beyond All-to-All: Causal-Aligned Transformer with Dynamic Structure Learning for Multivariate Time Series Forecasting
por: Zhang, Xingyu, et al.
Publicado: (2025)
por: Zhang, Xingyu, et al.
Publicado: (2025)
ODIN: Disentangled Reward Mitigates Hacking in RLHF
por: Chen, Lichang, et al.
Publicado: (2024)
por: Chen, Lichang, et al.
Publicado: (2024)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
por: Lian, Jiesong, et al.
Publicado: (2025)
por: Lian, Jiesong, et al.
Publicado: (2025)
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
por: Ono, Shinnosuke, et al.
Publicado: (2026)
por: Ono, Shinnosuke, et al.
Publicado: (2026)
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
por: Wang, Jingyao, et al.
Publicado: (2026)
por: Wang, Jingyao, et al.
Publicado: (2026)
UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
por: Fu, Lingling, et al.
Publicado: (2025)
por: Fu, Lingling, et al.
Publicado: (2025)
Group Causal Policy Optimization for Post-Training Large Language Models
por: Gu, Ziyin, et al.
Publicado: (2025)
por: Gu, Ziyin, et al.
Publicado: (2025)
Defining and Characterizing Reward Hacking
por: Skalse, Joar, et al.
Publicado: (2022)
por: Skalse, Joar, et al.
Publicado: (2022)
On the Out-of-Distribution Generalization of Self-Supervised Learning
por: Qiang, Wenwen, et al.
Publicado: (2025)
por: Qiang, Wenwen, et al.
Publicado: (2025)
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
por: Wang, Jingyao, et al.
Publicado: (2025)
por: Wang, Jingyao, et al.
Publicado: (2025)
On the Generalization and Causal Explanation in Self-Supervised Learning
por: Qiang, Wenwen, et al.
Publicado: (2024)
por: Qiang, Wenwen, et al.
Publicado: (2024)
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
por: Yu, Zhuohao, et al.
Publicado: (2026)
por: Yu, Zhuohao, et al.
Publicado: (2026)
When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
por: Wu, Rui, et al.
Publicado: (2026)
por: Wu, Rui, et al.
Publicado: (2026)
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
por: Deng, Wenlong, et al.
Publicado: (2026)
por: Deng, Wenlong, et al.
Publicado: (2026)
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
por: Laidlaw, Cassidy, et al.
Publicado: (2024)
por: Laidlaw, Cassidy, et al.
Publicado: (2024)
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
por: Miao, Yuchun, et al.
Publicado: (2025)
por: Miao, Yuchun, et al.
Publicado: (2025)
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
por: Roth, Amit, et al.
Publicado: (2026)
por: Roth, Amit, et al.
Publicado: (2026)
Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLMs
por: Zhang, Xingyu, et al.
Publicado: (2026)
por: Zhang, Xingyu, et al.
Publicado: (2026)
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
por: Wang, Xiaohua, et al.
Publicado: (2026)
por: Wang, Xiaohua, et al.
Publicado: (2026)
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
por: Wang, Ye, et al.
Publicado: (2026)
por: Wang, Ye, et al.
Publicado: (2026)
Detecting and Suppressing Reward Hacking with Gradient Fingerprints
por: Wang, Songtao, et al.
Publicado: (2026)
por: Wang, Songtao, et al.
Publicado: (2026)
IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
por: Beigi, Mohammad, et al.
Publicado: (2026)
por: Beigi, Mohammad, et al.
Publicado: (2026)
MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models
por: Zhai, Kevin, et al.
Publicado: (2025)
por: Zhai, Kevin, et al.
Publicado: (2025)
MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
por: Ichihara, Yuki, et al.
Publicado: (2025)
por: Ichihara, Yuki, et al.
Publicado: (2025)
Ejemplares similares
-
Reward Model Generalization for Compute-Aware Test-Time Reasoning
por: Song, Zeen, et al.
Publicado: (2025) -
Hacking Task Confounder in Meta-Learning
por: Wang, Jingyao, et al.
Publicado: (2023) -
Reward Shaping to Mitigate Reward Hacking in RLHF
por: Fu, Jiayi, et al.
Publicado: (2025) -
Reward Hacking Mitigation using Verifiable Composite Rewards
por: Tarek, Mirza Farhan Bin, et al.
Publicado: (2025) -
Repairing Reward Functions with Feedback to Mitigate Reward Hacking
por: Hatgis-Kessell, Stephane, et al.
Publicado: (2025)