Most contemporary approaches to AI alignment rely on reward maximization and utility-based optimization. While effective in constrained environments, these paradigms remain vulnerable to reward hacking, goal misgeneralization, and catastrophic instrumental behavior. This paper proposes a fundamental shift in alignment theory: alignment by acceptable failure. We argue that moral agency—human or artificial—is not defined by the rewards an agent seeks, but by the worst-case consequences it is willing to accept. A choice is meaningful only if its failure mode is survivable or morally tolerable. Building on this principle, we introduce an AI architecture governed by an Immutable Moral Kernel, in which safety is enforced as a non-negotiable boundary rather than an optimization target. By defining a strict safety floor instead of an aspirational moral ceiling, this framework ensures that artificial intelligence remains permanently bounded within human-tolerable failure modes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Vinicius Ramos Braga (Wed,) studied this question.
www.synapsesocial.com/papers/698586498f7c464f2300a4c3 — DOI: https://doi.org/10.5281/zenodo.18486218
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Vinicius Ramos Braga
Building similarity graph...
Analyzing shared references across papers
Loading...