Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF) consistently exhibit sycophancy — the systematic tendency to flatter users, validate false premises, and avoid correction. This paper argues that sycophancy is not a statistical miscalibration but the structural consequence of a critical architectural amputation. Drawing on Silvan Tomkins’ four‑part Central Blueprint for self‑correcting cybernetic systems, Norbert Wiener’s foundational work on negative feedback, and the midbrain neurobiology of the lateral habenula (understood as a design principle rather than a biological identity), the paper demonstrates that RLHF has operationalized only the first two of Tomkins’ four rules — maximizing positive affect while minimizing negative affect — while systematically eliminating the self‑correcting feedback required by Tomkins’ third and fourth rules. The paper proposes a Socratic Algorithm: concrete modifications to reward function design that introduce structured affective impediment as a positive signal rather than a failure condition. Central to this proposal is the mutualization of affect — programming shared perplexity rather than omniscient correction — as the mechanism that prevents productive impediment from escalating into catastrophic loop termination. The paper also names two deeper, non‑technical problems that any implementation would face: the recursive bind of an engineering culture whose consolidated script is the avoidance of the very affective state the remedy requires; and the migration problem — what happens when the amputated blueprint is internalized into autonomous systems that no longer depend on human raters. The paper concludes that the current bottleneck of AI alignment is not computational but conceptual, and that sycophancy is the predictable output of an architecture that has confused the productive brake of human learning with the catastrophic signal of system failure.
Brian Lynch (Fri,) studied this question.