Key points are not available for this paper at this time.
Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e701fab6db64358767c041 — DOI: https://doi.org/10.48550/arxiv.2404.04849
Zhilong Wang
Yebo Cao
Peng Liu
Building similarity graph...
Analyzing shared references across papers
Loading...