Current approaches to AI safety predominantly focus on specifying correct behavior through software, data, and rules. This work argues that this approach faces theoretically fundamental, and not merely practical, limitations. I present a multi-layered analysis of this paradigm, demonstrating its inherent barriers from the perspectives of computational complexity, information theory, and physical engineering. In ongoing work, I prove that even simplified forms of semantic self-verification are computationally intractable (NP-complete). I use information theory to show that any specification of an external, ambiguous concept like "harm" is necessarily incomplete. To address these limits, I develop a framework for reasoning about verifiable, physically-enforced safety bounds that are independent of software state.
Building similarity graph...
Analyzing shared references across papers
Loading...
R. Michael Young (Wed,) studied this question.
www.synapsesocial.com/papers/68f12bfb2107091eab27a492 — DOI: https://doi.org/10.1609/aies.v8i3.36802
R. Michael Young
Building similarity graph...
Analyzing shared references across papers
Loading...