RLHF Creates Mechanistically Distinct Alignment Regimes: SAE Feature Analysis Reveals Suppression, Override, and Neutralization of Self-Referential Processing | Synapse