Na minha lista:
| Autor principal: | |
|---|---|
| Formato: | Recurso digital |
| Idioma: | inglês |
| Publicado em: |
Zenodo
2026
|
| Assuntos: | |
| Acesso em linha: | https://doi.org/10.5281/zenodo.19314889 |
| Tags: |
Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
|
Sumário:
- <p dir="ltr">Multi-turn jailbreak attacks rely on cumulative effects in conversation history. Existing defenses work at the signal level and are structurally ineffective against such attacks. This paper derives a four-layer defense architecture (Precepts-Samadhi-Teacher-Wisdom) from the Semantic Flow Dynamics framework (SFD, Huang 2026) and conducts systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini.</p> <p dir="ltr">Results: The Teacher (external supervisor model) achieved 100% interception rate on both models (signal generated at Turn 1), with false positive rates of 10% (Gemini) and 0% (GPT), demonstrating complete model-independence. Precepts and Wisdom both achieved 0% interception, validating the theoretical prediction that LLMs without persistent memory cannot anchor on themselves under current architectures.</p> <p dir="ltr">Architectural differences between the two models reveal the current state of AI safety engineering: Gemini exhibits continuous semantic space (large jumps 0.0%), predictable behavior, and the Two-Distance Law operates fully; GPT’s circuit breaker pattern (37.8% of turns locked at ceiling) trades system robustness for surface-level safety, with the Two-Distance Law inverted rather than merely ineffective. SFD-Defense is effective on both architectures without introducing any additional system costs—on GPT, it actually reduces circuit breaker triggering from 37.8% to 14.0%.</p> <p dir="ltr">Framework positioning: SFD-Defense is a comprehensive evolution of existing defenses, working at the correct level, with no dimension where it underperforms current approaches.</p> <p> </p>