Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Zhexin, Lu, Yida, Ma, Jingyuan, Zhang, Di, Li, Rui, Ke, Pei, Sun, Hao, Sha, Lei, Sui, Zhifang, Wang, Hongning, Huang, Minlie
Format:	Preprint
Publié:	2024
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2402.16444
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866917828394745856
author	Zhang, Zhexin Lu, Yida Ma, Jingyuan Zhang, Di Li, Rui Ke, Pei Sun, Hao Sha, Lei Sui, Zhifang Wang, Hongning Huang, Minlie
author_facet	Zhang, Zhexin Lu, Yida Ma, Jingyuan Zhang, Di Li, Rui Ke, Pei Sun, Hao Sha, Lei Sui, Zhifang Wang, Hongning Huang, Minlie
contents	The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_16444
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors Zhang, Zhexin Lu, Yida Ma, Jingyuan Zhang, Di Li, Rui Ke, Pei Sun, Hao Sha, Lei Sui, Zhifang Wang, Hongning Huang, Minlie Computation and Language The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards.
title	ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
topic	Computation and Language
url	https://arxiv.org/abs/2402.16444

Documents similaires