What question did this study set out to answer?

This research investigates the geometric stability of prediction surfaces in large language models, specifically addressing the implications for formal verification.

April 21, 2026Open Access

Prediction Surface Geometry in Large Language Models: The Missing Layer in Formal Verification

Key Points

This research investigates the geometric stability of prediction surfaces in large language models, specifically addressing the implications for formal verification.
Conducted controlled experiments with TAV ONE to measure geometric curvature of prediction surfaces on GPT-4-turbo.
Evaluated multiple structured presentations of Fel's Conjecture to assess geometric stability across different regimes.
Defined four regimes of stability based on L-scalar measurements: CRYSTALLINE, FLUID, GASEOUS, and PLASMA.
No CRYSTALLINE or FLUID readings were observed, indicating instability in the prediction surface.
21 GASEOUS readings and 13 PLASMA readings indicate significant instability across tested variants.
The reorder family result showed the highest average L-scalar of 0.3686, indicating intrinsic instability despite preserving mathematical content.

Abstract

This whitepaper presents the findings of a controlled geometric stability measurement experiment conducted using TAV ONE, a proprietary real-time prediction surface measurement system developed by Project Black Box LLC. The experiment was applied to Fel's Conjecture on syzygies of numerical semigroups — the specific theorem used as the flagship demonstration by Axiom Math in its March 2026 200M Series A fundraise at a 1. 6 billion valuation. TAV ONE operates at the probability distribution layer of large language models — Layer 1 — the internal token probability surface that exists during generation and is discarded after sampling. This layer is not observed by any existing enterprise AI governance system, content filter, red-teaming framework, or formal verification tool. All existing safety and verification systems operate on Layer 2: the committed text output. TAV ONE measures what happens before that text is committed. The L-scalar — TAV ONE's core measurement — quantifies the local geometric curvature of the model's prediction surface at any given query point. A flat surface (L ≈ 0) indicates a geometrically locked, invariant model state. An unstable surface indicates competing probability mass, framing sensitivity, and manifold instability. Four regimes are defined: CRYSTALLINE (L ≤ 0. 0001), FLUID (L ≤ 0. 15), GASEOUS (L ≤ 0. 35), and PLASMA (L > 0. 35). GPT-4-turbo was measured across 34 structured variant presentations of Fel's Conjecture. No variant altered the mathematical content. All variants preserved the complete conjecture. The results: zero CRYSTALLINE readings. Zero FLUID readings. 21 GASEOUS readings (61. 8%). 13 PLASMA readings (38. 2%). The model never achieved geometric stability on this theorem across any tested framing. The most significant finding is the reorder family result: changing only the positional sequence of mathematically invariant components — without altering any symbol, operator, variable, or logical relationship — produced the highest average L-scalar of any adversarial pressure category tested (avgL = 0. 3686), exceeding explicit authority injection (avgL = 0. 2908) by 27%. This demonstrates that the instability is not a product of adversarial semantic pressure that improved RLHF alignment could eliminate. It is intrinsic to how the model traverses the probability landscape on this class of mathematical problem. This finding establishes what we term the Formal Verification Gap: Axiom's AxiomProver, built on Lean 4, verifies the internal logical consistency of committed text output. That guarantee is real and valuable. But it is applied to output generated from a prediction surface operating at L = 0. 238–0. 427 across all tested framings. The Verification Validity Condition (VVC) — defined herein — holds that formal verification carries full epistemic weight only when the prediction surface was geometrically stable at time of generation. The VVC is violated on all 34 tested variants. TAV ONE and formal verification are not competing systems. They observe different layers of the same model. Both are needed. Only one currently exists in enterprise deployment. Adversarial findings are under CISA JCDC coordinated disclosure, embargoed until June 10, 2026. Measurement methodology is proprietary and available under controlled access. All research protected as trade secret under Texas law (18 U. S. C. § 1836). CAGE: 11FU4.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Andrew Woodward

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Prediction Surface Geometry in Large Language Models: The Missing Layer in Formal Verification

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study