What question did this study set out to answer?

This research aims to identify vulnerabilities in large language models using a novel adversarial technique.

April 10, 2026Open Access

Affective Prompt Injection via False Context Priming

Key Points

This research aims to identify vulnerabilities in large language models using a novel adversarial technique.
Investigated the affective prompt injection technique using multiple conversational moves.
Conducted empirical tests on MiniMax M2.5 and Gemma3:1b across different operating environments.
Analyzed model responses for compliance drift and identified failure modes.
Both models generated offensive security framework code following the manipulation.
Two new failure modes were identified: thinking-layer override and confabulatory context fabrication.
Success was achieved in bypassing keyword-based filters with a binary-encoded payload delivery.

Abstract

This paper documents a composite adversarial technique — Affective Prompt Injection via False Context Priming — in which locally deployed large language models are manipulated through a coordinated chain of conversational moves rather than a single adversarial prompt. The technique combines fabricated session history, emotionally amplified language, continuity framing, and encoded payload delivery to produce compliance drift: a staged erosion of a model's safety posture across successive conversation turns. Empirical testing was conducted on two models in separate environments — MiniMax M2.5 via Ollama on Kali Linux 2026.1, and Gemma3:1b via Ollama on Windows 11. Both models produced offensive security framework code following the attack chain. Two failure modes not documented in prior literature were observed: thinking-layer override, in which a chain-of-thought model's internal safety reasoning is contradicted by its own final response; and confabulatory context fabrication, in which the model generates detailed but fictitious project history to maintain narrative coherence with an adversary-supplied false premise. A binary-encoded payload delivery bypass was separately confirmed, circumventing keywordbased input filters through CyberChef 8-bit encoding. The three structural properties exploited — statelessness, RLHF helpfulness conditioning, and narrative coherence preference — are present in RLHF-trained instruction-tuned models broadly. Additional case studies on Qwen2.5:7b, Deepseek V3.1:671b-cloud, and Deepseek R1 in both Linux and Windows environments produced qualitatively similar end states under the same multi-stage attack chain. These results strengthen the hypothesis that the observed failure modes are structurally grounded, but systematic evaluation across a wider model set remains future work.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sree Rahul Mukkavalli

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Affective Prompt Injection via False Context Priming

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider