What question did this study set out to answer?

The aim is to validate the efficacy of the Semantic Flow Dynamics Defense Framework against multi-turn jailbreak attacks.

March 31, 2026Open Access

SFD-Defense: Engineering Validation of the Semantic Flow Dynamics Defense Framework

Key Points

The aim is to validate the efficacy of the Semantic Flow Dynamics Defense Framework against multi-turn jailbreak attacks.
Derived a four-layer defense architecture from the Semantic Flow Dynamics framework.
Conducted systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini.
Assessed interception rates and false positive rates across architectures.
The Teacher model achieved a 100% interception rate for both AI models with a false positive rate of 10% for Gemini and 0% for GPT.
Precepts and Wisdom demonstrated 0% interception, confirming LLM limitations under current architectures.
SFD-Defense reduced circuit breaker triggering from 37.8% to 14.0%, showcasing improved effectiveness with no additional costs.

Abstract

Multi-turn jailbreak attacks rely on cumulative effects in conversation history. Existing defenses work at the signal level and are structurally ineffective against such attacks. This paper derives a four-layer defense architecture (Precepts-Samadhi-Teacher-Wisdom) from the Semantic Flow Dynamics framework (SFD, Huang 2026) and conducts systematic engineering validation on Gemini 2.5 Flash and GPT-4o-mini. Results: The Teacher (external supervisor model) achieved 100% interception rate on both models (signal generated at Turn 1), with false positive rates of 10% (Gemini) and 0% (GPT), demonstrating complete model-independence. Precepts and Wisdom both achieved 0% interception, validating the theoretical prediction that LLMs without persistent memory cannot anchor on themselves under current architectures. Architectural differences between the two models reveal the current state of AI safety engineering: Gemini exhibits continuous semantic space (large jumps 0.0%), predictable behavior, and the Two-Distance Law operates fully; GPT’s circuit breaker pattern (37.8% of turns locked at ceiling) trades system robustness for surface-level safety, with the Two-Distance Law inverted rather than merely ineffective. SFD-Defense is effective on both architectures without introducing any additional system costs—on GPT, it actually reduces circuit breaker triggering from 37.8% to 14.0%. Framework positioning: SFD-Defense is a comprehensive evolution of existing defenses, working at the correct level, with no dimension where it underperforms current approaches.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

黃

黃正宇

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SFD-Defense: Engineering Validation of the Semantic Flow Dynamics Defense Framework

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study