Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yuhui, Zhu, Rongyi, Wang, Ting
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Cryptography and Security
Online Access:	https://arxiv.org/abs/2505.12186
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917300941094912
author	Wang, Yuhui Zhu, Rongyi Wang, Ting
author_facet	Wang, Yuhui Zhu, Rongyi Wang, Ting
contents	Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://github.com/ZJUWYH/seam. (Warning: this paper contains potentially harmful content generated by LLMs.)
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_12186
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Self-Destructive Language Model Wang, Yuhui Zhu, Rongyi Wang, Ting Machine Learning Artificial Intelligence Cryptography and Security Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://github.com/ZJUWYH/seam. (Warning: this paper contains potentially harmful content generated by LLMs.)
title	Self-Destructive Language Model
topic	Machine Learning Artificial Intelligence Cryptography and Security
url	https://arxiv.org/abs/2505.12186

Similar Items