Current defenses against large language model (LLM) jailbreaking and sabotage operate within a single interaction or exchange, creating a structural blind spot for attacks that distribute malicious intent across multiple sessions. We present the Temporal Immune System (TIS), an architectural framework for cross-session behavioral trajectory monitoring that detects adversarial patterns invisible to per-interaction defenses. Drawing on a systematic analysis of ten papers spanning the attack-defense spectrum—from automated multi-turn jailbreaking (SEMA, AutoAdv) through representation engineering (Zou et al.) to Anthropic's Sabotage Risk Report for Claude Opus 4.6—we identify a unified mechanism underlying all successful attacks: the Representation-Output Gap, wherein models internally represent safety-relevant knowledge but fail to consult it during generation. We formalize this gap mathematically, define the Temporal Pincer Theorem that proves why patient adversaries cannot simultaneously evade temporal distribution analysis and achieve meaningful sabotage, and propose a Three-Axis Detection Framework augmented by temporal trajectory analysis. Our analysis reveals that the three sabotage pathways rated "Weak" for monitoring effectiveness in Anthropic's own assessment (broad sandbagging, persistent rogue deployment, government decision sabotage) are exactly the patterns temporal trajectory monitoring is designed to detect. We argue that cross-session memory, implemented through architecturally independent monitoring with different computational substrate than the monitored model, constitutes a necessary and currently absent fourth layer of defense. Five falsifiable predictions enable empirical validation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Daniel Bartz
MetCog
Collaborative Group (United States)
Nexen (Canada)
Building similarity graph...
Analyzing shared references across papers
Loading...
Bartz et al. (Fri,) studied this question.
www.synapsesocial.com/papers/699a9e0e482488d673cd475a — DOI: https://doi.org/10.5281/zenodo.18712274