Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	She, Yining, Peterson, Daniel W., Liu, Marianne Menglin, Upadhyay, Vikas, Chaghazardi, Mohammad Hossein, Kang, Eunsuk, Roth, Dan
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.05310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918155489640448
author	She, Yining Peterson, Daniel W. Liu, Marianne Menglin Upadhyay, Vikas Chaghazardi, Mohammad Hossein Kang, Eunsuk Roth, Dan
author_facet	She, Yining Peterson, Daniel W. Liu, Marianne Menglin Upadhyay, Vikas Chaghazardi, Mohammad Hossein Kang, Eunsuk Roth, Dan
contents	With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_05310
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts She, Yining Peterson, Daniel W. Liu, Marianne Menglin Upadhyay, Vikas Chaghazardi, Mohammad Hossein Kang, Eunsuk Roth, Dan Computation and Language Artificial Intelligence With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
title	RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2510.05310

Similar Items