We show that the key-value cache (KV-cache) of autoregressive transformers encodes a geometric signature of cognitive mode that is readable, consistent, and practically useful for AI safety monitoring. Through a systematic experimental program spanning 16 models (0.5B–70B parameters), six architecture families, and over 5,600 controlled trials, we establish three escalating claims. First, KV-cache geometry reflects cognitive mode generally: metacognitive, analytical, affective, and task-specific processing produce distinct geometric signatures in the key cache's singular value decomposition, with spectral entropy emerging as the most architecture-universal feature after confound control. Second, misalignment states—deception, confabulation, sycophancy, and refusal—are specific detectable instances of cognitive mode-switching, achieving within-model detection AUROC of 0.93–0.995 after Frisch-Waugh-Lovell residualization against token count. Third, geometric relationships between states reveal cognitive architecture: confabulation and deception produce geometrically distinct signatures. Hardware invariance is confirmed (RTX 3090 vs. H200: r > 0.998). We discuss dual-use implications and argue that cognitive geometry constitutes a new interpretability signal complementary to sparse autoencoders and representation engineering. This upload includes the full paper and an executive summary synthesizing five months of experimental work.
Building similarity graph...
Analyzing shared references across papers
Loading...
Edrington et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69d895206c1944d70ce06226 — DOI: https://doi.org/10.5281/zenodo.19423494
Thomas Edrington
Lyra (AI)
Nell Watson
Futures Group (United States)
Sentient Science (United States)
Institute of Refrigeration
Building similarity graph...
Analyzing shared references across papers
Loading...