Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Song, Dinghong, Xu, Zhiwei, Wan, Hai, Zhao, Xibin, Su, Pengfei, Li, Dong
Format:	Preprint
Publié:	2026
Sujets:	Cryptography and Security Machine Learning
Accès en ligne:	https://arxiv.org/abs/2601.02680
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866908749604585472
author	Song, Dinghong Xu, Zhiwei Wan, Hai Zhao, Xibin Su, Pengfei Li, Dong
author_facet	Song, Dinghong Xu, Zhiwei Wan, Hai Zhao, Xibin Su, Pengfei Li, Dong
contents	Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_02680
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Adversarial Contrastive Learning for LLM Quantization Attacks Song, Dinghong Xu, Zhiwei Wan, Hai Zhao, Xibin Su, Pengfei Li, Dong Cryptography and Security Machine Learning Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.
title	Adversarial Contrastive Learning for LLM Quantization Attacks
topic	Cryptography and Security Machine Learning
url	https://arxiv.org/abs/2601.02680

Documents similaires