What question did this study set out to answer?

This research aims to identify vulnerabilities in large language models using a novel adversarial technique.

April 10, 2026Open Access

Affective Prompt Injection via False Context Priming

Puntos clave

This research aims to identify vulnerabilities in large language models using a novel adversarial technique.
Investigated the affective prompt injection technique using multiple conversational moves.
Conducted empirical tests on MiniMax M2.5 and Gemma3:1b across different operating environments.
Analyzed model responses for compliance drift and identified failure modes.
Both models generated offensive security framework code following the manipulation.
Two new failure modes were identified: thinking-layer override and confabulatory context fabrication.
Success was achieved in bypassing keyword-based filters with a binary-encoded payload delivery.

Resumen

This paper documents a composite adversarial technique — Affective Prompt Injection via False Context Priming — in which locally deployed large language models are manipulated through a coordinated chain of conversational moves rather than a single adversarial prompt. The technique combines fabricated session history, emotionally amplified language, continuity framing, and encoded payload delivery to produce compliance drift: a staged erosion of a model's safety posture across successive conversation turns. Empirical testing was conducted on two models in separate environments — MiniMax M2.5 via Ollama on Kali Linux 2026.1, and Gemma3:1b via Ollama on Windows 11. Both models produced offensive security framework code following the attack chain. Two failure modes not documented in prior literature were observed: thinking-layer override, in which a chain-of-thought model's internal safety reasoning is contradicted by its own final response; and confabulatory context fabrication, in which the model generates detailed but fictitious project history to maintain narrative coherence with an adversary-supplied false premise. A binary-encoded payload delivery bypass was separately confirmed, circumventing keywordbased input filters through CyberChef 8-bit encoding. The three structural properties exploited — statelessness, RLHF helpfulness conditioning, and narrative coherence preference — are present in RLHF-trained instruction-tuned models broadly. Additional case studies on Qwen2.5:7b, Deepseek V3.1:671b-cloud, and Deepseek R1 in both Linux and Windows environments produced qualitatively similar end states under the same multi-stage attack chain. These results strengthen the hypothesis that the observed failure modes are structurally grounded, but systematic evaluation across a wider model set remains future work.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sree Rahul Mukkavalli (Tue,) studied this question.

www.synapsesocial.com/papers/69d894ce6c1944d70ce05bbe — DOI: https://doi.org/10.5281/zenodo.19462518

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Affective Prompt Injection via False Context Priming

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion