Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Zihui, Gao, Haichang, Luo, Jiacheng, Liu, Zhaoxiang
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Cryptography and Security
Online Access:	https://arxiv.org/abs/2501.13677
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917069918830592
author	Wu, Zihui Gao, Haichang Luo, Jiacheng Liu, Zhaoxiang
author_facet	Wu, Zihui Gao, Haichang Luo, Jiacheng Liu, Zhaoxiang
contents	Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_13677
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor Wu, Zihui Gao, Haichang Luo, Jiacheng Liu, Zhaoxiang Machine Learning Cryptography and Security Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.
title	HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
topic	Machine Learning Cryptography and Security
url	https://arxiv.org/abs/2501.13677

Similar Items