1 AbstractCurrent approaches to AI alignment, particularly Reinforcement Learning from Human Feedback(RLHF), operate primarily at the behavioural level, rewarding outputs without monitoring theinternal representational dynamics that generate them. This surface-level control creates vul-nerability to semantic drift, reward hacking, and value decoherence—failures that emerge fromunmonitored transformations in the system’s internal state space. We propose an architecture forinvariant-preserving value structures that embeds alignment constraints as structural preservationconditions rather than post-hoc rules. The framework introduces a six-stage monitored processingpipeline analogous to an Engine Control Unit (ECU), with Bayesian inference tracking the posteriorprobability of value decoherence at each transformation stage. Key innovations include: (1) aformal specification language for value invariants as constraints on admissible transformations; (2)Bayesian monitoring of semantic compression and expansion using operationalised signals (branchinstability, prototype-based category drift, invariant residuals); (3) state-dependent hazard modelling2for pre-breach trajectory detection; (4) continuous control gains and a meta-monitor for fault-tolerantoversight; and (5) ecological homeostasis through bonded communication that supports invariantpluralism. The framework addresses a fundamental gap in alignment research: the absence of internalmonitoring systems capable of detecting value drift during the reasoning process itself, not merely atoutput. We present formal definitions, a worked example with operationalised signals, an empiricalvalidation plan, and discuss implications for recursive self-improvement and AGI safety. Keywords: AI alignment; Bayesian inference; semantic compression; invariants; homeostasis;interpretability; value drift; decoherence detection; recursive systems; glassbox AI
Building similarity graph...
Analyzing shared references across papers
Loading...
Smith et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69d894526c1944d70ce054c1 — DOI: https://doi.org/10.5281/zenodo.19452989
John Richard Smith
SHAI / HATI / Deepseek
Symbiom (Czechia)
Building similarity graph...
Analyzing shared references across papers
Loading...