:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wilhelm, Alexander, Zweig, Katharina A.
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.16626
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Monitoring Emergent Reward Hacking During Generation via Internal Activations
by: Wilhelm, Patrick, et al.
Published: (2026)

Quantitative study about the estimated impact of the AI Act
by: Hauer, Marc P., et al.
Published: (2023)

MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data
by: Malafaia, Mafalda, et al.
Published: (2024)

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
by: Taylor, Mia, et al.
Published: (2025)

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
by: Roth, Amit, et al.
Published: (2026)

Forms of Understanding for XAI-Explanations
by: Buschmeier, Hendrik, et al.
Published: (2023)

Trustworthy XAI and Application
by: Nasim, MD Abdullah Al, et al.
Published: (2024)

Energy Guided Geometric Flow Matching
by: Zweig, Aaron, et al.
Published: (2025)

Guaranteed prediction sets for functional surrogate models
by: Gray, Ander, et al.
Published: (2025)

Hacking CTFs with Plain Agents
by: Turtayev, Rustem, et al.
Published: (2024)

Reward Hacking in Rubric-Based Reinforcement Learning
by: Mahmoud, Anas, et al.
Published: (2026)

Feedback Loops With Language Models Drive In-Context Reward Hacking
by: Pan, Alexander, et al.
Published: (2024)

Development of a graph neural network surrogate for travel demand modelling
by: Makarov, Nikita, et al.
Published: (2024)

Explanation Hacking: The perils of algorithmic recourse
by: Sullivan, Emily, et al.
Published: (2024)

A multi-model approach using XAI and anomaly detection to predict asteroid hazards
by: Mondal, Amit Kumar, et al.
Published: (2025)

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
by: Schulhoff, Sander, et al.
Published: (2023)

LLM Agents can Autonomously Hack Websites
by: Fang, Richard, et al.
Published: (2024)

Spontaneous Reward Hacking in Iterative Self-Refinement
by: Pan, Jane, et al.
Published: (2024)

Mitigating Preference Hacking in Policy Optimization with Pessimism
by: Gupta, Dhawal, et al.
Published: (2025)

Building surrogate models using trajectories of agents trained by Reinforcement Learning
by: Cestero, Julen, et al.
Published: (2025)

Deep learning surrogate models of JULES-INFERNO for wildfire prediction on a global scale
by: Cheng, Sibo, et al.
Published: (2024)

Explaining deep neural network models for electricity price forecasting with XAI
by: Pesenti, Antoine, et al.
Published: (2025)

On Teacher Hacking in Language Model Distillation
by: Tiapkin, Daniil, et al.
Published: (2025)

Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant
by: He, Gaole, et al.
Published: (2025)

AI Readiness in Healthcare through Storytelling XAI
by: Dubey, Akshat, et al.
Published: (2024)

The Role of XAI in Transforming Aeronautics and Aerospace Systems
by: Zorita, Francisco Javier Cantero, et al.
Published: (2024)

Exploring SAIG Methods for an Objective Evaluation of XAI
by: Miró-Nicolau, Miquel, et al.
Published: (2026)

Guidelines For The Choice Of The Baseline in XAI Attribution Methods
by: Morasso, Cristian, et al.
Published: (2025)

Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents
by: Jeurissen, Dominik, et al.
Published: (2024)

Proof-of-Use: Mitigating Tool-Call Hacking in Deep Research Agents
by: Ma, SHengjie, et al.
Published: (2025)

Reward Hacking Mitigation using Verifiable Composite Rewards
by: Tarek, Mirza Farhan Bin, et al.
Published: (2025)

PenTest++: Elevating Ethical Hacking with AI and Automation
by: Al-Sinani, Haitham S., et al.
Published: (2025)

Repairing Reward Functions with Feedback to Mitigate Reward Hacking
by: Hatgis-Kessell, Stephane, et al.
Published: (2025)

Efficient reformulations of ReLU deep neural networks for surrogate modelling in power system optimisation
by: Kumar, Yogesh Pipada Sunil, et al.
Published: (2026)

XAI for Point Cloud Data using Perturbations based on Meaningful Segmentation
by: Mulawade, Raju Ningappa, et al.
Published: (2025)

ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024)

Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)

Reward Hacking as Equilibrium under Finite Evaluation
by: Wang, Jiacheng, et al.
Published: (2026)

A Mechanistic Explanatory Strategy for XAI
by: Rabiza, Marcin
Published: (2024)

RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents
by: Atinafu, Yonas, et al.
Published: (2026)