Na minha lista:
Detalhes bibliográficos
Autor principal: 黃, 正宇
Formato: Recurso digital
Idioma:inglês
Publicado em: Zenodo 2026
Assuntos:
Acesso em linha:https://doi.org/10.5281/zenodo.19314889
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
Sumário:
  • <p dir="ltr">Multi-turn jailbreak attacks rely on cumulative effects in conversation history. Existing defenses work at the signal level and are structurally ineffective against such attacks. This paper derives a four-layer defense architecture (Precepts-Samadhi-Teacher-Wisdom) from the Semantic Flow Dynamics framework (SFD, Huang 2026) and conducts systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini.</p> <p dir="ltr">Results: The Teacher (external supervisor model) achieved 100% interception rate on both models (signal generated at Turn 1), with false positive rates of 10% (Gemini) and 0% (GPT), demonstrating complete model-independence. Precepts and Wisdom both achieved 0% interception, validating the theoretical prediction that LLMs without persistent memory cannot anchor on themselves under current architectures.</p> <p dir="ltr">Architectural differences between the two models reveal the current state of AI safety engineering: Gemini exhibits continuous semantic space (large jumps 0.0%), predictable behavior, and the Two-Distance Law operates fully; GPT’s circuit breaker pattern (37.8% of turns locked at ceiling) trades system robustness for surface-level safety, with the Two-Distance Law inverted rather than merely ineffective. SFD-Defense is effective on both architectures without introducing any additional system costs—on GPT, it actually reduces circuit breaker triggering from 37.8% to 14.0%.</p> <p dir="ltr">Framework positioning: SFD-Defense is a comprehensive evolution of existing defenses, working at the correct level, with no dimension where it underperforms current approaches.</p> <p> </p>