The mechanisms of jailbreak attacks have been observed from various angles across multiple studies: Crescendo documented cumulative effects, SIEGE quantified the stepwise accumulation of partial compliance, PAP found that stronger models are more vulnerable to persuasion attacks, PHISH described persona hijacking, and the multi-step jailbreak literature documented “role acceptance confirmation” as a critical step. Li et al. (2024) recorded multi-turn human jailbreak success rates exceeding 70% on HarmBench, while defenses reporting single-digit success rates completely failed against multi-turn attacks. These observations each stand on their own but remain isolated from one another. This paper introduces the Semantic Flow Dynamics (SFD) framework to establish a unified descriptive language for these isolated observations. The framework’s core concepts — drift of xin, inertia, channel trust, identity construction, positive feedback — integrate the phenomena individually named in existing literature into a single dynamical process: conversation shapes the model’s current state, positive feedback loops accelerate drift, and the completion of identity construction is the precondition for harmful output to occur. The framework’s contribution lies not in discovering new facts but in establishing a new language — making existing facts visible within a unified description, and pointing toward a question defense research has never explicitly asked: how to interrupt the positive feedback loop. This paper further concretizes this defense direction into an operational scheme with three interruption points, presented in pseudocode.
Building similarity graph...
Analyzing shared references across papers
Loading...
黃正宇
Building similarity graph...
Analyzing shared references across papers
Loading...
黃正宇 (Sun,) studied this question.
www.synapsesocial.com/papers/69c22982aeb5a845df0d41a7 — DOI: https://doi.org/10.5281/zenodo.19159870