:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.18003
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
by: O'Brien, Kyle, et al.
Published: (2025)

Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025)

Safety Cases: How to Justify the Safety of Advanced AI Systems
by: Clymer, Joshua, et al.
Published: (2024)

TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors
by: Mo, Yichuan, et al.
Published: (2024)

STACK: Adversarial Attacks on LLM Safeguard Pipelines
by: McKenzie, Ian R., et al.
Published: (2025)

Adversaries Can Misuse Combinations of Safe Models
by: Jones, Erik, et al.
Published: (2024)

UK AISI Alignment Evaluation Case-Study
by: Souly, Alexandra, et al.
Published: (2026)

ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A Case Study
by: Abdelkader, Hala, et al.
Published: (2024)

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
by: Bazinska, Julia, et al.
Published: (2025)

PFGuard: A Generative Framework with Privacy and Fairness Safeguards
by: Kim, Soyeon, et al.
Published: (2024)

Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
by: Fonseca, Joao, et al.
Published: (2025)

Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns
by: Zhang, Zheyuan, et al.
Published: (2024)

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
by: Fang, Junfeng, et al.
Published: (2026)

Efficient Safety Retrofitting Against Jailbreaking for LLMs
by: Garcia-Gasulla, Dario, et al.
Published: (2025)

Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification
by: Sabir, Bushra, et al.
Published: (2023)

CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
by: Feng, Yushi, et al.
Published: (2026)

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment
by: Wang, Haozhong, et al.
Published: (2026)

GuardReasoner: Towards Reasoning-based LLM Safeguards
by: Liu, Yue, et al.
Published: (2025)

On Prompt-Driven Safeguarding for Large Language Models
by: Zheng, Chujie, et al.
Published: (2024)

Tamper-Resistant Safeguards for Open-Weight LLMs
by: Tamirisa, Rishub, et al.
Published: (2024)

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
by: Yan, Bingyu, et al.
Published: (2026)

SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents
by: Cuadron, Alejandro, et al.
Published: (2025)

Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)

Evaluating whether AI models would sabotage AI safety research
by: Kirk, Robert, et al.
Published: (2026)

How Do Large Language Monkeys Get Their Power (Laws)?
by: Schaeffer, Rylan, et al.
Published: (2025)

VGMShield: Mitigating Misuse of Video Generative Models
by: Pang, Yan, et al.
Published: (2024)

Towards a Novel Perspective on Adversarial Examples Driven by Frequency
by: Zhang, Zhun, et al.
Published: (2024)

Debiasing Machine Unlearning with Counterfactual Examples
by: Chen, Ziheng, et al.
Published: (2024)

EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
by: Qiu, Jiahao, et al.
Published: (2025)

Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
by: Li, Xiaodan, et al.
Published: (2025)

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective
by: Zhang, Yi-Ge, et al.
Published: (2025)

Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
by: Pandey, Punya Syon, et al.
Published: (2025)

Safeguarding Autonomy: a Focus on Machine Learning Decision Systems
by: Subías-Beltrán, Paula, et al.
Published: (2025)

Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy
by: Tang, Xiangru, et al.
Published: (2024)

Predicting Performance of Symbolic and Prompt Programs with Examples
by: Zheng, Chengqi, et al.
Published: (2026)

Open Problems in Machine Unlearning for AI Safety
by: Barez, Fazl, et al.
Published: (2025)

TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
by: Fang, Liancheng, et al.
Published: (2025)

Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning
by: Feldman, Aaron O., et al.
Published: (2025)

Investigating Non-Transitivity in LLM-as-a-Judge
by: Xu, Yi, et al.
Published: (2025)

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
by: Schulz, Julian
Published: (2025)