Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Siyu, Zeng, Long, Wu, Xuecheng, Han, Chengcheng, Zhang, Kongcheng, Peng, Chong, Cao, Xuezhi, Cai, Xunliang, Guo, Chenjuan
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.14651
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.

Similar Items