Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Chua, Gabriel, Chan, Shing Yee, Khoo, Shaun
Formato:	Preprint
Publicado:	2024
Materias:	Computation and Language Machine Learning 68T50 I.2.7
Acceso en línea:	https://arxiv.org/abs/2411.12946
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915234422194176
author	Chua, Gabriel Chan, Shing Yee Khoo, Shaun
author_facet	Chua, Gabriel Chan, Shing Yee Khoo, Shaun
contents	Large Language Models (LLMs) are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_12946
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection Chua, Gabriel Chan, Shing Yee Khoo, Shaun Computation and Language Machine Learning 68T50 I.2.7 Large Language Models (LLMs) are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.
title	A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
topic	Computation and Language Machine Learning 68T50 I.2.7
url	https://arxiv.org/abs/2411.12946

Ejemplares similares