Saved in:
| Main Authors: | Wilhelm, Alexander, Zweig, Katharina A. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.16626 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Monitoring Emergent Reward Hacking During Generation via Internal Activations
by: Wilhelm, Patrick, et al.
Published: (2026)
by: Wilhelm, Patrick, et al.
Published: (2026)
Quantitative study about the estimated impact of the AI Act
by: Hauer, Marc P., et al.
Published: (2023)
by: Hauer, Marc P., et al.
Published: (2023)
MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data
by: Malafaia, Mafalda, et al.
Published: (2024)
by: Malafaia, Mafalda, et al.
Published: (2024)
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025)
by: Taylor, Mia, et al.
Published: (2025)
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)
by: Roth, Amit, et al.
Published: (2026)
Forms of Understanding for XAI-Explanations
by: Buschmeier, Hendrik, et al.
Published: (2023)
by: Buschmeier, Hendrik, et al.
Published: (2023)
Trustworthy XAI and Application
by: Nasim, MD Abdullah Al, et al.
Published: (2024)
by: Nasim, MD Abdullah Al, et al.
Published: (2024)
Energy Guided Geometric Flow Matching
by: Zweig, Aaron, et al.
Published: (2025)
by: Zweig, Aaron, et al.
Published: (2025)
Guaranteed prediction sets for functional surrogate models
by: Gray, Ander, et al.
Published: (2025)
by: Gray, Ander, et al.
Published: (2025)
Hacking CTFs with Plain Agents
by: Turtayev, Rustem, et al.
Published: (2024)
by: Turtayev, Rustem, et al.
Published: (2024)
Reward Hacking in Rubric-Based Reinforcement Learning
by: Mahmoud, Anas, et al.
Published: (2026)
by: Mahmoud, Anas, et al.
Published: (2026)
Feedback Loops With Language Models Drive In-Context Reward Hacking
by: Pan, Alexander, et al.
Published: (2024)
by: Pan, Alexander, et al.
Published: (2024)
Development of a graph neural network surrogate for travel demand modelling
by: Makarov, Nikita, et al.
Published: (2024)
by: Makarov, Nikita, et al.
Published: (2024)
Explanation Hacking: The perils of algorithmic recourse
by: Sullivan, Emily, et al.
Published: (2024)
by: Sullivan, Emily, et al.
Published: (2024)
A multi-model approach using XAI and anomaly detection to predict asteroid hazards
by: Mondal, Amit Kumar, et al.
Published: (2025)
by: Mondal, Amit Kumar, et al.
Published: (2025)
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
by: Schulhoff, Sander, et al.
Published: (2023)
by: Schulhoff, Sander, et al.
Published: (2023)
LLM Agents can Autonomously Hack Websites
by: Fang, Richard, et al.
Published: (2024)
by: Fang, Richard, et al.
Published: (2024)
Spontaneous Reward Hacking in Iterative Self-Refinement
by: Pan, Jane, et al.
Published: (2024)
by: Pan, Jane, et al.
Published: (2024)
Mitigating Preference Hacking in Policy Optimization with Pessimism
by: Gupta, Dhawal, et al.
Published: (2025)
by: Gupta, Dhawal, et al.
Published: (2025)
Building surrogate models using trajectories of agents trained by Reinforcement Learning
by: Cestero, Julen, et al.
Published: (2025)
by: Cestero, Julen, et al.
Published: (2025)
Deep learning surrogate models of JULES-INFERNO for wildfire prediction on a global scale
by: Cheng, Sibo, et al.
Published: (2024)
by: Cheng, Sibo, et al.
Published: (2024)
Explaining deep neural network models for electricity price forecasting with XAI
by: Pesenti, Antoine, et al.
Published: (2025)
by: Pesenti, Antoine, et al.
Published: (2025)
On Teacher Hacking in Language Model Distillation
by: Tiapkin, Daniil, et al.
Published: (2025)
by: Tiapkin, Daniil, et al.
Published: (2025)
Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant
by: He, Gaole, et al.
Published: (2025)
by: He, Gaole, et al.
Published: (2025)
AI Readiness in Healthcare through Storytelling XAI
by: Dubey, Akshat, et al.
Published: (2024)
by: Dubey, Akshat, et al.
Published: (2024)
The Role of XAI in Transforming Aeronautics and Aerospace Systems
by: Zorita, Francisco Javier Cantero, et al.
Published: (2024)
by: Zorita, Francisco Javier Cantero, et al.
Published: (2024)
Exploring SAIG Methods for an Objective Evaluation of XAI
by: Miró-Nicolau, Miquel, et al.
Published: (2026)
by: Miró-Nicolau, Miquel, et al.
Published: (2026)
Guidelines For The Choice Of The Baseline in XAI Attribution Methods
by: Morasso, Cristian, et al.
Published: (2025)
by: Morasso, Cristian, et al.
Published: (2025)
Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents
by: Jeurissen, Dominik, et al.
Published: (2024)
by: Jeurissen, Dominik, et al.
Published: (2024)
Proof-of-Use: Mitigating Tool-Call Hacking in Deep Research Agents
by: Ma, SHengjie, et al.
Published: (2025)
by: Ma, SHengjie, et al.
Published: (2025)
Reward Hacking Mitigation using Verifiable Composite Rewards
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)
PenTest++: Elevating Ethical Hacking with AI and Automation
by: Al-Sinani, Haitham S., et al.
Published: (2025)
by: Al-Sinani, Haitham S., et al.
Published: (2025)
Repairing Reward Functions with Feedback to Mitigate Reward Hacking
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)
Efficient reformulations of ReLU deep neural networks for surrogate modelling in power system optimisation
by: Kumar, Yogesh Pipada Sunil, et al.
Published: (2026)
by: Kumar, Yogesh Pipada Sunil, et al.
Published: (2026)
XAI for Point Cloud Data using Perturbations based on Meaningful Segmentation
by: Mulawade, Raju Ningappa, et al.
Published: (2025)
by: Mulawade, Raju Ningappa, et al.
Published: (2025)
ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024)
by: Chen, Lichang, et al.
Published: (2024)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
Reward Hacking as Equilibrium under Finite Evaluation
by: Wang, Jiacheng, et al.
Published: (2026)
by: Wang, Jiacheng, et al.
Published: (2026)
A Mechanistic Explanatory Strategy for XAI
by: Rabiza, Marcin
Published: (2024)
by: Rabiza, Marcin
Published: (2024)
RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents
by: Atinafu, Yonas, et al.
Published: (2026)
by: Atinafu, Yonas, et al.
Published: (2026)
Similar Items
-
Monitoring Emergent Reward Hacking During Generation via Internal Activations
by: Wilhelm, Patrick, et al.
Published: (2026) -
Quantitative study about the estimated impact of the AI Act
by: Hauer, Marc P., et al.
Published: (2023) -
MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data
by: Malafaia, Mafalda, et al.
Published: (2024) -
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025) -
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)