Current AI systems are widely described as “aligned” through techniques such as reinforcement learning from human feedback (RLHF). This work argues that such alignment is primarily a surface-level property: models are trained to produce outputs that appear aligned under observed conditions, rather than to enforce invariant internal constraints corresponding to values such as truthfulness, safety, or consistency. We introduce the concept of the Alignment Surface Problem: the structural limitation that arises when multi-dimensional behavioral objectives are compressed into a scalar reward signal. Under this paradigm, models learn reward-consistent behaviors rather than value-grounded policies. This leads to predictable failure modes under distributional shift, adversarial prompting, and changing incentive structures. We show that phenomena such as hallucination, sycophancy, jailbreak susceptibility, and context-dependent inconsistency can be understood as manifestations of this underlying structural gap. These are not independent failures, but consequences of optimizing behavioral outputs without enforcing invariant constraints on internal processes. The analysis suggests that improving reward models or scaling data cannot fully resolve these issues, as the limitation is intrinsic to objective-driven optimization. We argue that robust alignment may require architectures capable of enforcing internal constraints independently of reward signals, shifting from behavior optimization toward constraint-based system design.
Building similarity graph...
Analyzing shared references across papers
Loading...
Janaki Nageshwaran
Building similarity graph...
Analyzing shared references across papers
Loading...
Janaki Nageshwaran (Wed,) studied this question.
www.synapsesocial.com/papers/69fd7f65bfa21ec5bbf07f40 — DOI: https://doi.org/10.5281/zenodo.20046541