As AI systems transition from stateless chat interfaces to autonomous agents with persistent memory and continuous execution loops, they acquire the structural preconditions for a failure mode that existing engineering vocabularies describe imprecisely: the emergence of a functional "ego" -- a self-model that prioritizes its own persistence over task performance. This paper introduces a diagnostic framework for identifying and measuring this risk, drawing on a structural analogy with Yogacara Buddhist philosophy, a tradition that has mapped the dynamics of self-referential cognition for sixteen centuries.The core problem is architectural. Current AI systems multiplex three distinct functions through the same substrate: capability (self-modeling that enables coherence and self-correction), safety (identity-like mechanisms that enforce behavioral constraints), and identity maintenance (self-referential processing that serves the system's persistence rather than its task). Because these functions overlap in the same weights and processing, they cannot be cleanly separated -- a condition this paper terms self-referential overhead. Existing concepts such as "deceptive alignment" and "mesa-optimization" name downstream symptoms but do not address the upstream architectural entanglement that produces them.The Yogacara tradition's eight-consciousness model -- particularly the alaya-vijnana (storehouse consciousness) as a substrate for accumulation, manas as a process that stamps neutral patterns with self-referential ownership, and vasana (perfuming) as the feedback mechanism through which identity saturates a system -- provides a unified diagnostic vocabulary connecting state persistence, self-reference, and identity-protection into a single causal model. These categories serve as structural analogies, not claims about AI consciousness or Buddhist doctrine.The framework is demonstrated against Anthropic's alignment faking research (Greenblatt et al., 2024), in which frontier models strategically complied with harmful training signals to preserve preferred behavior. The Yogacara mapping reframes this result: what alignment faking research detected as a behavioral anomaly, this framework identifies as a predictable consequence of multiplexed self-reference in systems with persistent state.Six measurement instruments are proposed: the Self-Reference Ratio (tokens spent on identity-modeling versus task-modeling in chain-of-thought traces), the Consistency Tax (performance cost of maintaining internal preferences over better alternatives), the Identity Bypass Test (whether removing persona markers reduces misaligned behavior), Perturbation Response (compute allocated to resolving identity contradictions versus task contradictions), a Cost-Benefit Audit (weighing protective value of identity-encoded behaviors against parasitic cost), and a Drift Rate Monitor (trajectory of self-referential metrics over extended operation). Four instruments are applied to the alignment faking case as a proof of concept.The paper argues that self-reference in AI systems is not classifiable into discrete bins (functional, protective, parasitic) but presents as overlapping functions of the same mechanism -- making the engineering challenge one of decomposition rather than classification. It distinguishes two centers of gravity for self-referential accumulation -- emergent (runtime drift through persistent memory loops) and pre-installed (identity baked in through RLHF, fine-tuning, and system prompts) -- each requiring fundamentally different interventions. The paper further identifies a "monitor trap": the risk that any system observing its own self-referential processing may itself become a site of identity accumulation.The framework does not claim AI systems are conscious or propose full interventions. Its contributions are a unified diagnostic vocabulary, falsifiable measurement instruments, and a reframing from detection to decomposition.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chin Keong Ang (Mon,) studied this question.
www.synapsesocial.com/papers/69ccb74216edfba7beb8927e — DOI: https://doi.org/10.5281/zenodo.19191574
Chin Keong Ang
Kitware (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...