Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yu, Mingyu, Wang, Wei, Wei, Yanjie, Qin, Sujuan, Gao, Fei, Li, Wenmin
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2505.23404
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866911516612100096
author	Yu, Mingyu Wang, Wei Wei, Yanjie Qin, Sujuan Gao, Fei Li, Wenmin
author_facet	Yu, Mingyu Wang, Wei Wei, Yanjie Qin, Sujuan Gao, Fei Li, Wenmin
contents	Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Our experiments find that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the target LLM. Building on this insight, we propose an Adaptive Jailbreak Framework (AJF) based on the comprehension ability of black-box large language models. Specifically, AJF first categorizes the comprehension ability of the LLM and then applies different strategies accordingly: For models with limited comprehension ability (Type-I LLMs), AJF integrates layered semantic mutations with an encryption technique (MuEn strategy), to more effectively evade the LLM's defenses during the input and inference stages. For models with strong comprehension ability (Type-II LLMs), AJF employs a more complex strategy that builds upon the MuEn strategy by adding an additional layer: inducing the LLM to generate an encrypted response. This forms a dual-end encryption scheme (MuDeEn strategy), further bypassing the LLM's defenses during the output stage. Experimental results demonstrate the effectiveness of our approach, achieving attack success rates of \textbf{98.9\%} on GPT-4o (29 May 2025 release) and \textbf{99.8\%} on GPT-4.1 (8 July 2025 release). Our work contributes to a deeper understanding of the vulnerabilities in current LLMs alignment mechanisms.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_23404
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models Yu, Mingyu Wang, Wei Wei, Yanjie Qin, Sujuan Gao, Fei Li, Wenmin Computation and Language Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Our experiments find that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the target LLM. Building on this insight, we propose an Adaptive Jailbreak Framework (AJF) based on the comprehension ability of black-box large language models. Specifically, AJF first categorizes the comprehension ability of the LLM and then applies different strategies accordingly: For models with limited comprehension ability (Type-I LLMs), AJF integrates layered semantic mutations with an encryption technique (MuEn strategy), to more effectively evade the LLM's defenses during the input and inference stages. For models with strong comprehension ability (Type-II LLMs), AJF employs a more complex strategy that builds upon the MuEn strategy by adding an additional layer: inducing the LLM to generate an encrypted response. This forms a dual-end encryption scheme (MuDeEn strategy), further bypassing the LLM's defenses during the output stage. Experimental results demonstrate the effectiveness of our approach, achieving attack success rates of \textbf{98.9\%} on GPT-4o (29 May 2025 release) and \textbf{99.8\%} on GPT-4.1 (8 July 2025 release). Our work contributes to a deeper understanding of the vulnerabilities in current LLMs alignment mechanisms.
title	AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2505.23404

Ähnliche Einträge