Saved in:
Bibliographic Details
Main Authors: Wu, Zihui, Gao, Haichang, Luo, Jiacheng, Liu, Zhaoxiang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.13677
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917069918830592
author Wu, Zihui
Gao, Haichang
Luo, Jiacheng
Liu, Zhaoxiang
author_facet Wu, Zihui
Gao, Haichang
Luo, Jiacheng
Liu, Zhaoxiang
contents Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.
format Preprint
id arxiv_https___arxiv_org_abs_2501_13677
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Wu, Zihui
Gao, Haichang
Luo, Jiacheng
Liu, Zhaoxiang
Machine Learning
Cryptography and Security
Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.
title HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
topic Machine Learning
Cryptography and Security
url https://arxiv.org/abs/2501.13677