What type of study is this?

This is a Experimental Study study.

October 20, 2025Open Access

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Key Points

ASGuard effectively reduces the success rate of targeted jailbreaking while preserving model capabilities.
Using circuit analysis, the method identifies attention heads linked to vulnerabilities from tense-changing attacks.
The framework employs channel-wise scaling vectors to recalibrate activation, transitioning to preventative fine-tuning.
Findings illustrate the importance of a mechanistic understanding in developing targeted measures for AI safety.

Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Y.K. Park

Jungwoo Park

Jaewoo Kang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider