What question did this study set out to answer?

The aim is to unify various perspectives on jailbreak attacks and explore methods to disrupt harmful outputs.

March 24, 2026Open Access

SFDJailbreakAttacksₐsIdentityConstructionDynamics

Key Points

The aim is to unify various perspectives on jailbreak attacks and explore methods to disrupt harmful outputs.
Introduced the Semantic Flow Dynamics (SFD) framework as a descriptive language.
Analyzed existing literature to identify key steps and dynamics in jailbreak attacks.
Developed an operational defense scheme with three specific interruption points.
Identified core concepts such as drift of xin and inertia that influence identity construction.
Established that conversation and feedback loops play critical roles in jailbreak success.
Presented new insights for future defensive strategies against jailbreak attacks.

Abstract

The mechanisms of jailbreak attacks have been observed from various angles across multiple studies: Crescendo documented cumulative effects, SIEGE quantified the stepwise accumulation of partial compliance, PAP found that stronger models are more vulnerable to persuasion attacks, PHISH described persona hijacking, and the multi-step jailbreak literature documented “role acceptance confirmation” as a critical step. Li et al. (2024) recorded multi-turn human jailbreak success rates exceeding 70% on HarmBench, while defenses reporting single-digit success rates completely failed against multi-turn attacks. These observations each stand on their own but remain isolated from one another. This paper introduces the Semantic Flow Dynamics (SFD) framework to establish a unified descriptive language for these isolated observations. The framework’s core concepts — drift of xin, inertia, channel trust, identity construction, positive feedback — integrate the phenomena individually named in existing literature into a single dynamical process: conversation shapes the model’s current state, positive feedback loops accelerate drift, and the completion of identity construction is the precondition for harmful output to occur. The framework’s contribution lies not in discovering new facts but in establishing a new language — making existing facts visible within a unified description, and pointing toward a question defense research has never explicitly asked: how to interrupt the positive feedback loop. This paper further concretizes this defense direction into an operational scheme with three interruption points, presented in pseudocode.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

黃

黃正宇

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SFDJailbreakAttacksₐsIdentityConstructionDynamics

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study