Saved in:
| Main Authors: | Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.18003 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
by: O'Brien, Kyle, et al.
Published: (2025)
by: O'Brien, Kyle, et al.
Published: (2025)
Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025)
by: Feng, Zhili, et al.
Published: (2025)
Safety Cases: How to Justify the Safety of Advanced AI Systems
by: Clymer, Joshua, et al.
Published: (2024)
by: Clymer, Joshua, et al.
Published: (2024)
TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors
by: Mo, Yichuan, et al.
Published: (2024)
by: Mo, Yichuan, et al.
Published: (2024)
STACK: Adversarial Attacks on LLM Safeguard Pipelines
by: McKenzie, Ian R., et al.
Published: (2025)
by: McKenzie, Ian R., et al.
Published: (2025)
Adversaries Can Misuse Combinations of Safe Models
by: Jones, Erik, et al.
Published: (2024)
by: Jones, Erik, et al.
Published: (2024)
UK AISI Alignment Evaluation Case-Study
by: Souly, Alexandra, et al.
Published: (2026)
by: Souly, Alexandra, et al.
Published: (2026)
ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A Case Study
by: Abdelkader, Hala, et al.
Published: (2024)
by: Abdelkader, Hala, et al.
Published: (2024)
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
by: Bazinska, Julia, et al.
Published: (2025)
by: Bazinska, Julia, et al.
Published: (2025)
PFGuard: A Generative Framework with Privacy and Fairness Safeguards
by: Kim, Soyeon, et al.
Published: (2024)
by: Kim, Soyeon, et al.
Published: (2024)
Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
by: Fonseca, Joao, et al.
Published: (2025)
by: Fonseca, Joao, et al.
Published: (2025)
Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns
by: Zhang, Zheyuan, et al.
Published: (2024)
by: Zhang, Zheyuan, et al.
Published: (2024)
NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
by: Fang, Junfeng, et al.
Published: (2026)
by: Fang, Junfeng, et al.
Published: (2026)
Efficient Safety Retrofitting Against Jailbreaking for LLMs
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification
by: Sabir, Bushra, et al.
Published: (2023)
by: Sabir, Bushra, et al.
Published: (2023)
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
by: Feng, Yushi, et al.
Published: (2026)
by: Feng, Yushi, et al.
Published: (2026)
Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment
by: Wang, Haozhong, et al.
Published: (2026)
by: Wang, Haozhong, et al.
Published: (2026)
GuardReasoner: Towards Reasoning-based LLM Safeguards
by: Liu, Yue, et al.
Published: (2025)
by: Liu, Yue, et al.
Published: (2025)
On Prompt-Driven Safeguarding for Large Language Models
by: Zheng, Chujie, et al.
Published: (2024)
by: Zheng, Chujie, et al.
Published: (2024)
Tamper-Resistant Safeguards for Open-Weight LLMs
by: Tamirisa, Rishub, et al.
Published: (2024)
by: Tamirisa, Rishub, et al.
Published: (2024)
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
by: Yan, Bingyu, et al.
Published: (2026)
by: Yan, Bingyu, et al.
Published: (2026)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents
by: Cuadron, Alejandro, et al.
Published: (2025)
by: Cuadron, Alejandro, et al.
Published: (2025)
Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)
by: Wolf, Lorenz, et al.
Published: (2025)
Evaluating whether AI models would sabotage AI safety research
by: Kirk, Robert, et al.
Published: (2026)
by: Kirk, Robert, et al.
Published: (2026)
How Do Large Language Monkeys Get Their Power (Laws)?
by: Schaeffer, Rylan, et al.
Published: (2025)
by: Schaeffer, Rylan, et al.
Published: (2025)
VGMShield: Mitigating Misuse of Video Generative Models
by: Pang, Yan, et al.
Published: (2024)
by: Pang, Yan, et al.
Published: (2024)
Towards a Novel Perspective on Adversarial Examples Driven by Frequency
by: Zhang, Zhun, et al.
Published: (2024)
by: Zhang, Zhun, et al.
Published: (2024)
Debiasing Machine Unlearning with Counterfactual Examples
by: Chen, Ziheng, et al.
Published: (2024)
by: Chen, Ziheng, et al.
Published: (2024)
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
by: Qiu, Jiahao, et al.
Published: (2025)
by: Qiu, Jiahao, et al.
Published: (2025)
Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
by: Li, Xiaodan, et al.
Published: (2025)
by: Li, Xiaodan, et al.
Published: (2025)
Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective
by: Zhang, Yi-Ge, et al.
Published: (2025)
by: Zhang, Yi-Ge, et al.
Published: (2025)
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
by: Pandey, Punya Syon, et al.
Published: (2025)
by: Pandey, Punya Syon, et al.
Published: (2025)
Safeguarding Autonomy: a Focus on Machine Learning Decision Systems
by: Subías-Beltrán, Paula, et al.
Published: (2025)
by: Subías-Beltrán, Paula, et al.
Published: (2025)
Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy
by: Tang, Xiangru, et al.
Published: (2024)
by: Tang, Xiangru, et al.
Published: (2024)
Predicting Performance of Symbolic and Prompt Programs with Examples
by: Zheng, Chengqi, et al.
Published: (2026)
by: Zheng, Chengqi, et al.
Published: (2026)
Open Problems in Machine Unlearning for AI Safety
by: Barez, Fazl, et al.
Published: (2025)
by: Barez, Fazl, et al.
Published: (2025)
TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
by: Fang, Liancheng, et al.
Published: (2025)
by: Fang, Liancheng, et al.
Published: (2025)
Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning
by: Feldman, Aaron O., et al.
Published: (2025)
by: Feldman, Aaron O., et al.
Published: (2025)
Investigating Non-Transitivity in LLM-as-a-Judge
by: Xu, Yi, et al.
Published: (2025)
by: Xu, Yi, et al.
Published: (2025)
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
by: Schulz, Julian
Published: (2025)
by: Schulz, Julian
Published: (2025)
Similar Items
-
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
by: O'Brien, Kyle, et al.
Published: (2025) -
Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025) -
Safety Cases: How to Justify the Safety of Advanced AI Systems
by: Clymer, Joshua, et al.
Published: (2024) -
TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors
by: Mo, Yichuan, et al.
Published: (2024) -
STACK: Adversarial Attacks on LLM Safeguard Pipelines
by: McKenzie, Ian R., et al.
Published: (2025)