:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bertollo, Giacomo, Bodemir, Naz, Burgess, Jonah
Format:	Preprint
Published:	2025
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.16005
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Enhancing Guardrails for Safe and Secure Healthcare AI
by: Gangavarapu, Ananya
Published: (2024)

A Comparative Evaluation of AI Agent Security Guardrails
by: Li, Qi, et al.
Published: (2026)

No Free Lunch with Guardrails
by: Kumar, Divyanshu, et al.
Published: (2025)

Provably Secure Agent Guardrail
by: Wu, Benlong, et al.
Published: (2026)

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
by: Jin, Xisen, et al.
Published: (2026)

DNN-Defender: A Victim-Focused In-DRAM Defense Mechanism for Taming Adversarial Weight Attack on DNNs
by: Zhou, Ranyang, et al.
Published: (2023)

Defending Against Unforeseen Failure Modes with Latent Adversarial Training
by: Casper, Stephen, et al.
Published: (2024)

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
by: Wong, Ryan, et al.
Published: (2025)

Current state of LLM Risks and AI Guardrails
by: Ayyamperumal, Suriya Ganesh, et al.
Published: (2024)

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
by: Wang, Xunguang, et al.
Published: (2024)

Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7
by: Aydin, Yuksel
Published: (2025)

SoK: Evaluating Jailbreak Guardrails for Large Language Models
by: Wang, Xunguang, et al.
Published: (2025)

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
by: Li, Nanxi, et al.
Published: (2026)

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks
by: Patil, KrishnaSaiReddy
Published: (2026)

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
by: Wu, ChenYu, et al.
Published: (2025)

Defending against Indirect Prompt Injection by Instruction Detection
by: Wen, Tongyu, et al.
Published: (2025)

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
by: Durner, Nils
Published: (2025)

AgentWall: A Runtime Safety Layer for Local AI Agents
by: Aravind, Ashwin
Published: (2026)

Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
by: Kang, Daewon, et al.
Published: (2025)

PuFace: Defending against Facial Cloaking Attacks for Facial Recognition Models
by: Wen, Jing
Published: (2024)

Defending against Stegomalware in Deep Neural Networks with Permutation Symmetry
by: Torpmann-Hagen, Birk, et al.
Published: (2025)

MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models
by: Cheng, Xueqi, et al.
Published: (2025)

Defending Against Beta Poisoning Attacks in Machine Learning Models
by: Gulciftci, Nilufer, et al.
Published: (2025)

Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks
by: Tsai, Yu-Che, et al.
Published: (2026)

No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
by: Xue, Zhiyu, et al.
Published: (2024)

The End of Trust: How Agentic AI Breaks Security Assumptions
by: Zafar, Osama, et al.
Published: (2026)

OneShield -- the Next Generation of LLM Guardrails
by: DeLuca, Chad, et al.
Published: (2025)

Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence
by: Chen, Ruoxi, et al.
Published: (2021)

RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage
by: Zhong, Peter Yong, et al.
Published: (2025)

Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding
by: Hu, Yuke, et al.
Published: (2023)

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
by: Campbell, David, et al.
Published: (2026)

Reliable Model Watermarking: Defending Against Theft without Compromising on Evasion
by: Zhu, Hongyu, et al.
Published: (2024)

To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack
by: Zhuo, Terry Yue, et al.
Published: (2026)

NeuroFilter: Privacy Guardrails for Conversational LLM Agents
by: Das, Saswat, et al.
Published: (2026)

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
by: Jin, Haotian, et al.
Published: (2025)

BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
by: Lee, Hanyong, et al.
Published: (2025)

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator
by: Zhang, Ruyi, et al.
Published: (2026)

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm
by: Pang, Yan, et al.
Published: (2025)