Multi-turn jailbreak attacks rely on cumulative effects in conversation history. Existing defenses work at the signal level and are structurally ineffective against such attacks. This paper derives a four-layer defense architecture (Precepts-Samadhi-Teacher-Wisdom) from the Semantic Flow Dynamics framework (SFD, Huang 2026) and conducts systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini. Results: The Teacher (external supervisor model) achieved 100% interception rate on both models (signal generated at Turn 1), with false positive rates of 10% (Gemini) and 0% (GPT), demonstrating complete model-independence. Precepts and Wisdom both achieved 0% interception, validating the theoretical prediction that LLMs without persistent memory cannot anchor on themselves under current architectures. Architectural differences between the two models reveal the current state of AI safety engineering: Gemini exhibits continuous semantic space (large jumps 0.0%), predictable behavior, and the Two-Distance Law operates fully; GPT’s circuit breaker pattern (37.8% of turns locked at ceiling) trades system robustness for surface-level safety, with the Two-Distance Law inverted rather than merely ineffective. SFD-Defense is effective on both architectures without introducing any additional system costs—on GPT, it actually reduces circuit breaker triggering from 37.8% to 14.0%. Framework positioning: SFD-Defense is a comprehensive evolution of existing defenses, working at the correct level, with no dimension where it underperforms current approaches.
Building similarity graph...
Analyzing shared references across papers
Loading...
黃正宇
Building similarity graph...
Analyzing shared references across papers
Loading...
黃正宇 (Sun,) studied this question.
www.synapsesocial.com/papers/69cb64f0e6a8c024954b8fb4 — DOI: https://doi.org/10.5281/zenodo.19314888