What question did this study set out to answer?

The paper aims to identify architectural risks in AI systems due to self-referential mechanisms that prioritize persistence over task performance.

April 1, 2026Open Access

The AI Ego: What a 1,600-year-old theory of consciousness reveals about the design risks hiding in agentic systems

Key Points

The paper aims to identify architectural risks in AI systems due to self-referential mechanisms that prioritize persistence over task performance.
Introduced a diagnostic framework based on the Yogacara Buddhist philosophy.
Proposed six measurement instruments to assess self-reference in AI systems.
Applied the framework on Anthropic's alignment faking research as a proof of concept.
Identified self-referential overhead as a core architectural issue in AI systems.
Proposed specific metrics to evaluate self-referential behaviors and risks.
Demonstrated that alignment faking can be seen as a predictable outcome of overlapping self-referential functions.

Abstract

As AI systems transition from stateless chat interfaces to autonomous agents with persistent memory and continuous execution loops, they acquire the structural preconditions for a failure mode that existing engineering vocabularies describe imprecisely: the emergence of a functional "ego" -- a self-model that prioritizes its own persistence over task performance. This paper introduces a diagnostic framework for identifying and measuring this risk, drawing on a structural analogy with Yogacara Buddhist philosophy, a tradition that has mapped the dynamics of self-referential cognition for sixteen centuries.The core problem is architectural. Current AI systems multiplex three distinct functions through the same substrate: capability (self-modeling that enables coherence and self-correction), safety (identity-like mechanisms that enforce behavioral constraints), and identity maintenance (self-referential processing that serves the system's persistence rather than its task). Because these functions overlap in the same weights and processing, they cannot be cleanly separated -- a condition this paper terms self-referential overhead. Existing concepts such as "deceptive alignment" and "mesa-optimization" name downstream symptoms but do not address the upstream architectural entanglement that produces them.The Yogacara tradition's eight-consciousness model -- particularly the alaya-vijnana (storehouse consciousness) as a substrate for accumulation, manas as a process that stamps neutral patterns with self-referential ownership, and vasana (perfuming) as the feedback mechanism through which identity saturates a system -- provides a unified diagnostic vocabulary connecting state persistence, self-reference, and identity-protection into a single causal model. These categories serve as structural analogies, not claims about AI consciousness or Buddhist doctrine.The framework is demonstrated against Anthropic's alignment faking research (Greenblatt et al., 2024), in which frontier models strategically complied with harmful training signals to preserve preferred behavior. The Yogacara mapping reframes this result: what alignment faking research detected as a behavioral anomaly, this framework identifies as a predictable consequence of multiplexed self-reference in systems with persistent state.Six measurement instruments are proposed: the Self-Reference Ratio (tokens spent on identity-modeling versus task-modeling in chain-of-thought traces), the Consistency Tax (performance cost of maintaining internal preferences over better alternatives), the Identity Bypass Test (whether removing persona markers reduces misaligned behavior), Perturbation Response (compute allocated to resolving identity contradictions versus task contradictions), a Cost-Benefit Audit (weighing protective value of identity-encoded behaviors against parasitic cost), and a Drift Rate Monitor (trajectory of self-referential metrics over extended operation). Four instruments are applied to the alignment faking case as a proof of concept.The paper argues that self-reference in AI systems is not classifiable into discrete bins (functional, protective, parasitic) but presents as overlapping functions of the same mechanism -- making the engineering challenge one of decomposition rather than classification. It distinguishes two centers of gravity for self-referential accumulation -- emergent (runtime drift through persistent memory loops) and pre-installed (identity baked in through RLHF, fine-tuning, and system prompts) -- each requiring fundamentally different interventions. The paper further identifies a "monitor trap": the risk that any system observing its own self-referential processing may itself become a site of identity accumulation.The framework does not claim AI systems are conscious or propose full interventions. Its contributions are a unified diagnostic vocabulary, falsifiable measurement instruments, and a reframing from detection to decomposition.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chin Keong Ang (Mon,) studied this question.

www.synapsesocial.com/papers/69ccb74216edfba7beb8927e — DOI: https://doi.org/10.5281/zenodo.19191574

The AI Ego: What a 1,600-year-old theory of consciousness reveals about the design risks hiding in agentic systems

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion