Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Njeh, Chaima, Nakouri, Haïfa, Jaafar, Fehmi
Format:	Preprint
Published:	2025
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.16120
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908333001146368
author	Njeh, Chaima Nakouri, Haïfa Jaafar, Fehmi
author_facet	Njeh, Chaima Nakouri, Haïfa Jaafar, Fehmi
contents	Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16120
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content Njeh, Chaima Nakouri, Haïfa Jaafar, Fehmi Cryptography and Security Artificial Intelligence Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.
title	A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2504.16120

Similar Items