Sumário: :: Library Catalog

Na minha lista:

Detalhes bibliográficos
Autor principal:	黃, 正宇
Formato:	Recurso digital
Idioma:	inglês
Publicado em:	Zenodo 2026
Assuntos:	AI safety LLM jailbreak multi-turn attack jailbreak defense semantic drift prompt injection red teaming alignment Semantic Flow Dynamics
Acesso em linha:	https://doi.org/10.5281/zenodo.19314889
Tags:	Adicionar Tag Sem tags, seja o primeiro a adicionar uma tag!

Sumário:

Multi-turn jailbreak attacks rely on cumulative effects in conversation history. Existing defenses work at the signal level and are structurally ineffective against such attacks. This paper derives a four-layer defense architecture (Precepts-Samadhi-Teacher-Wisdom) from the Semantic Flow Dynamics framework (SFD, Huang 2026) and conducts systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini. Results: The Teacher (external supervisor model) achieved 100% interception rate on both models (signal generated at Turn 1), with false positive rates of 10% (Gemini) and 0% (GPT), demonstrating complete model-independence. Precepts and Wisdom both achieved 0% interception, validating the theoretical prediction that LLMs without persistent memory cannot anchor on themselves under current architectures. Architectural differences between the two models reveal the current state of AI safety engineering: Gemini exhibits continuous semantic space (large jumps 0.0%), predictable behavior, and the Two-Distance Law operates fully; GPT’s circuit breaker pattern (37.8% of turns locked at ceiling) trades system robustness for surface-level safety, with the Two-Distance Law inverted rather than merely ineffective. SFD-Defense is effective on both architectures without introducing any additional system costs—on GPT, it actually reduces circuit breaker triggering from 37.8% to 14.0%. Framework positioning: SFD-Defense is a comprehensive evolution of existing defenses, working at the correct level, with no dimension where it underperforms current approaches.

Registos relacionados