Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Xiaoxue, Lee, Jaeha, Dick, Anna-Katharina, Timm, Jasper, Xie, Fei, Cruz, Diogo
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2508.07646
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911100858007552
author	Yang, Xiaoxue Lee, Jaeha Dick, Anna-Katharina Timm, Jasper Xie, Fei Cruz, Diogo
author_facet	Yang, Xiaoxue Lee, Jaeha Dick, Anna-Katharina Timm, Jasper Xie, Fei Cruz, Diogo
contents	While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_07646
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Multi-Turn Jailbreaks Are Simpler Than They Seem Yang, Xiaoxue Lee, Jaeha Dick, Anna-Katharina Timm, Jasper Xie, Fei Cruz, Diogo Machine Learning While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler
title	Multi-Turn Jailbreaks Are Simpler Than They Seem
topic	Machine Learning
url	https://arxiv.org/abs/2508.07646

Similar Items