Salvato in:
Dettagli Bibliografici
Autori principali: Choi, Minseok, Kim, Dongjin, Yang, Seungbin, Kim, Subin, Kwak, Youngjun, Oh, Juyoung, Choo, Jaegul, Son, Jungmin
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2603.02588
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866914364893691904
author Choi, Minseok
Kim, Dongjin
Yang, Seungbin
Kim, Subin
Kwak, Youngjun
Oh, Juyoung
Choo, Jaegul
Son, Jungmin
author_facet Choi, Minseok
Kim, Dongjin
Yang, Seungbin
Kim, Subin
Kwak, Youngjun
Oh, Juyoung
Choo, Jaegul
Son, Jungmin
contents With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
format Preprint
id arxiv_https___arxiv_org_abs_2603_02588
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ExpGuard: LLM Content Moderation in Specialized Domains
Choi, Minseok
Kim, Dongjin
Yang, Seungbin
Kim, Subin
Kwak, Youngjun
Oh, Juyoung
Choo, Jaegul
Son, Jungmin
Computation and Language
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
title ExpGuard: LLM Content Moderation in Specialized Domains
topic Computation and Language
url https://arxiv.org/abs/2603.02588