MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Choi, Minseok, Kim, Dongjin, Yang, Seungbin, Kim, Subin, Kwak, Youngjun, Oh, Juyoung, Choo, Jaegul, Son, Jungmin
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2603.02588
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914364893691904
author	Choi, Minseok Kim, Dongjin Yang, Seungbin Kim, Subin Kwak, Youngjun Oh, Juyoung Choo, Jaegul Son, Jungmin
author_facet	Choi, Minseok Kim, Dongjin Yang, Seungbin Kim, Subin Kwak, Youngjun Oh, Juyoung Choo, Jaegul Son, Jungmin
contents	With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_02588
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ExpGuard: LLM Content Moderation in Specialized Domains Choi, Minseok Kim, Dongjin Yang, Seungbin Kim, Subin Kwak, Youngjun Oh, Juyoung Choo, Jaegul Son, Jungmin Computation and Language With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
title	ExpGuard: LLM Content Moderation in Specialized Domains
topic	Computation and Language
url	https://arxiv.org/abs/2603.02588

Documenti analoghi