:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Song, Ruike, Song, Zeen, Guo, Huijie, Qiang, Wenwen
Formato:	Preprint
Publicado:	2025
Materias:	Machine Learning
Acceso en línea:	https://arxiv.org/abs/2508.04216
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Reward Model Generalization for Compute-Aware Test-Time Reasoning
por: Song, Zeen, et al.
Publicado: (2025)

Hacking Task Confounder in Meta-Learning
por: Wang, Jingyao, et al.
Publicado: (2023)

Reward Shaping to Mitigate Reward Hacking in RLHF
por: Fu, Jiayi, et al.
Publicado: (2025)

Reward Hacking Mitigation using Verifiable Composite Rewards
por: Tarek, Mirza Farhan Bin, et al.
Publicado: (2025)

Repairing Reward Functions with Feedback to Mitigate Reward Hacking
por: Hatgis-Kessell, Stephane, et al.
Publicado: (2025)

Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting
por: Zhang, Xingyu, et al.
Publicado: (2024)

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning
por: Song, Zeen, et al.
Publicado: (2026)

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
por: Beigi, Mohammad, et al.
Publicado: (2026)

Learning to Reason without External Rewards
por: Zhao, Xuandong, et al.
Publicado: (2025)

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
por: Singha, Disha
Publicado: (2026)

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
por: Miao, Yuchun, et al.
Publicado: (2024)

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
por: Zhou, Shuyi, et al.
Publicado: (2026)

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
por: Wang, Chaoqi, et al.
Publicado: (2025)

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
por: Miao, Yuchun, et al.
Publicado: (2025)

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
por: Eisenstein, Jacob, et al.
Publicado: (2023)

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
por: Liu, Zixuan, et al.
Publicado: (2026)

Beyond All-to-All: Causal-Aligned Transformer with Dynamic Structure Learning for Multivariate Time Series Forecasting
por: Zhang, Xingyu, et al.
Publicado: (2025)

ODIN: Disentangled Reward Mitigates Hacking in RLHF
por: Chen, Lichang, et al.
Publicado: (2024)

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
por: Lian, Jiesong, et al.
Publicado: (2025)

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
por: Ono, Shinnosuke, et al.
Publicado: (2026)

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
por: Wang, Jingyao, et al.
Publicado: (2026)

UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
por: Fu, Lingling, et al.
Publicado: (2025)

Group Causal Policy Optimization for Post-Training Large Language Models
por: Gu, Ziyin, et al.
Publicado: (2025)

Defining and Characterizing Reward Hacking
por: Skalse, Joar, et al.
Publicado: (2022)

On the Out-of-Distribution Generalization of Self-Supervised Learning
por: Qiang, Wenwen, et al.
Publicado: (2025)

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
por: Wang, Jingyao, et al.
Publicado: (2025)

On the Generalization and Causal Explanation in Self-Supervised Learning
por: Qiang, Wenwen, et al.
Publicado: (2024)

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
por: Yu, Zhuohao, et al.
Publicado: (2026)

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
por: Wu, Rui, et al.
Publicado: (2026)

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
por: Deng, Wenlong, et al.
Publicado: (2026)

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
por: Laidlaw, Cassidy, et al.
Publicado: (2024)

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
por: Miao, Yuchun, et al.
Publicado: (2025)

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
por: Roth, Amit, et al.
Publicado: (2026)

Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLMs
por: Zhang, Xingyu, et al.
Publicado: (2026)

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
por: Wang, Xiaohua, et al.
Publicado: (2026)

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
por: Wang, Ye, et al.
Publicado: (2026)

Detecting and Suppressing Reward Hacking with Gradient Fingerprints
por: Wang, Songtao, et al.
Publicado: (2026)

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
por: Beigi, Mohammad, et al.
Publicado: (2026)

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models
por: Zhai, Kevin, et al.
Publicado: (2025)

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
por: Ichihara, Yuki, et al.
Publicado: (2025)