Saved in:
| Main Authors: | Bertollo, Giacomo, Bodemir, Naz, Burgess, Jonah |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.16005 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Enhancing Guardrails for Safe and Secure Healthcare AI
by: Gangavarapu, Ananya
Published: (2024)
by: Gangavarapu, Ananya
Published: (2024)
A Comparative Evaluation of AI Agent Security Guardrails
by: Li, Qi, et al.
Published: (2026)
by: Li, Qi, et al.
Published: (2026)
No Free Lunch with Guardrails
by: Kumar, Divyanshu, et al.
Published: (2025)
by: Kumar, Divyanshu, et al.
Published: (2025)
Provably Secure Agent Guardrail
by: Wu, Benlong, et al.
Published: (2026)
by: Wu, Benlong, et al.
Published: (2026)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
by: Jin, Xisen, et al.
Published: (2026)
by: Jin, Xisen, et al.
Published: (2026)
DNN-Defender: A Victim-Focused In-DRAM Defense Mechanism for Taming Adversarial Weight Attack on DNNs
by: Zhou, Ranyang, et al.
Published: (2023)
by: Zhou, Ranyang, et al.
Published: (2023)
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
by: Casper, Stephen, et al.
Published: (2024)
by: Casper, Stephen, et al.
Published: (2024)
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
by: Wong, Ryan, et al.
Published: (2025)
by: Wong, Ryan, et al.
Published: (2025)
Current state of LLM Risks and AI Guardrails
by: Ayyamperumal, Suriya Ganesh, et al.
Published: (2024)
by: Ayyamperumal, Suriya Ganesh, et al.
Published: (2024)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
by: Wang, Xunguang, et al.
Published: (2024)
by: Wang, Xunguang, et al.
Published: (2024)
Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7
by: Aydin, Yuksel
Published: (2025)
by: Aydin, Yuksel
Published: (2025)
SoK: Evaluating Jailbreak Guardrails for Large Language Models
by: Wang, Xunguang, et al.
Published: (2025)
by: Wang, Xunguang, et al.
Published: (2025)
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
by: Li, Nanxi, et al.
Published: (2026)
by: Li, Nanxi, et al.
Published: (2026)
CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks
by: Patil, KrishnaSaiReddy
Published: (2026)
by: Patil, KrishnaSaiReddy
Published: (2026)
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)
by: Cao, Bochuan, et al.
Published: (2023)
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
by: Wu, ChenYu, et al.
Published: (2025)
by: Wu, ChenYu, et al.
Published: (2025)
Defending against Indirect Prompt Injection by Instruction Detection
by: Wen, Tongyu, et al.
Published: (2025)
by: Wen, Tongyu, et al.
Published: (2025)
In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
by: Durner, Nils
Published: (2025)
by: Durner, Nils
Published: (2025)
AgentWall: A Runtime Safety Layer for Local AI Agents
by: Aravind, Ashwin
Published: (2026)
by: Aravind, Ashwin
Published: (2026)
Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
by: Kang, Daewon, et al.
Published: (2025)
by: Kang, Daewon, et al.
Published: (2025)
PuFace: Defending against Facial Cloaking Attacks for Facial Recognition Models
by: Wen, Jing
Published: (2024)
by: Wen, Jing
Published: (2024)
Defending against Stegomalware in Deep Neural Networks with Permutation Symmetry
by: Torpmann-Hagen, Birk, et al.
Published: (2025)
by: Torpmann-Hagen, Birk, et al.
Published: (2025)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models
by: Cheng, Xueqi, et al.
Published: (2025)
by: Cheng, Xueqi, et al.
Published: (2025)
Defending Against Beta Poisoning Attacks in Machine Learning Models
by: Gulciftci, Nilufer, et al.
Published: (2025)
by: Gulciftci, Nilufer, et al.
Published: (2025)
Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks
by: Tsai, Yu-Che, et al.
Published: (2026)
by: Tsai, Yu-Che, et al.
Published: (2026)
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
by: Xue, Zhiyu, et al.
Published: (2024)
by: Xue, Zhiyu, et al.
Published: (2024)
The End of Trust: How Agentic AI Breaks Security Assumptions
by: Zafar, Osama, et al.
Published: (2026)
by: Zafar, Osama, et al.
Published: (2026)
OneShield -- the Next Generation of LLM Guardrails
by: DeLuca, Chad, et al.
Published: (2025)
by: DeLuca, Chad, et al.
Published: (2025)
Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence
by: Chen, Ruoxi, et al.
Published: (2021)
by: Chen, Ruoxi, et al.
Published: (2021)
RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage
by: Zhong, Peter Yong, et al.
Published: (2025)
by: Zhong, Peter Yong, et al.
Published: (2025)
Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding
by: Hu, Yuke, et al.
Published: (2023)
by: Hu, Yuke, et al.
Published: (2023)
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
by: Campbell, David, et al.
Published: (2026)
by: Campbell, David, et al.
Published: (2026)
Reliable Model Watermarking: Defending Against Theft without Compromising on Evasion
by: Zhu, Hongyu, et al.
Published: (2024)
by: Zhu, Hongyu, et al.
Published: (2024)
To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack
by: Zhuo, Terry Yue, et al.
Published: (2026)
by: Zhuo, Terry Yue, et al.
Published: (2026)
NeuroFilter: Privacy Guardrails for Conversational LLM Agents
by: Das, Saswat, et al.
Published: (2026)
by: Das, Saswat, et al.
Published: (2026)
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
by: Jin, Haotian, et al.
Published: (2025)
by: Jin, Haotian, et al.
Published: (2025)
BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
by: Lee, Hanyong, et al.
Published: (2025)
by: Lee, Hanyong, et al.
Published: (2025)
BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator
by: Zhang, Ruyi, et al.
Published: (2026)
by: Zhang, Ruyi, et al.
Published: (2026)
Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm
by: Pang, Yan, et al.
Published: (2025)
by: Pang, Yan, et al.
Published: (2025)
Similar Items
-
Enhancing Guardrails for Safe and Secure Healthcare AI
by: Gangavarapu, Ananya
Published: (2024) -
A Comparative Evaluation of AI Agent Security Guardrails
by: Li, Qi, et al.
Published: (2026) -
No Free Lunch with Guardrails
by: Kumar, Divyanshu, et al.
Published: (2025) -
Provably Secure Agent Guardrail
by: Wu, Benlong, et al.
Published: (2026) -
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)