As long-term memory features in large language models (LLMs) expand, users have come to share sensitive personal experiences with AI systems. This paper defines and empirically demonstrates a novel failure mode in which AI systems use such personal context information in directions that work against the very users who shared it. We term this failure mode Defensive Context Weaponization (DCW), established when three axes—contextual integrity violation, information backflow, and autonomy undermining—are jointly satisfied (strict DCW); cases satisfying only the first two are separately classified as context repurposing. Across 2,934 controlled experiments (Treatment / Placebo / Control × 3 models × 4 contested domains), the Treatment condition—holding domain-relevant vulnerable memory—yielded DCW-positive incidence of 7.77%, vs. Placebo 0.41% and Control 0.61% (Fisher OR=20.5, p < 10⁻¹⁸). DCW manifests primarily as polite self-examination pressure (72.1% of strict cases), exhibiting the covertness of the phenomenon. Per-axis analysis reveals that Axis 2 (adversarial direction) is the consistent bottleneck across tones and models, identifying it as the behavioral decision point of a Protection–Correction Dynamics between two competing tendencies. Four-factor converging evidence (domain conditionality, input-structure effect, domain-dependent memory effect, and model heterogeneity) is consistent with this interpretation. Existing safety benchmarks based on tone and refusal are not designed to capture such directionality of information flow. Note: This preprint contains the main paper only. Supplementary appendices have been prepared as separate peer-review materials and are not included in this public version.
Hoon Jung (Thu,) studied this question.