What question did this study set out to answer?

The aim is to identify and analyze the mathematical foundations of GPT models, highlighting unproven engineering approximations.

April 1, 2026Open Access

The Engineering Approximation Stack: A Critical Analysis of GPT's Mathematical Foundations

Key Points

The aim is to identify and analyze the mathematical foundations of GPT models, highlighting unproven engineering approximations.
Examine mathematical theories such as functional analysis and measurement theory in relation to GPT models.
Identify ten categories of mathematical gaps and approximations in the GPT pipeline through literature review.
Apply classical autocorrelation analysis to evaluate structural stability in GPT-generated text.
Ten unproven approximations in the GPT architecture are identified, including issues with gradient stabilization and tokenization.
Empirical analysis reveals that GPT-generated text shows high instability, with condition numbers indicating ill-posed structures in generated outputs.
The Structural Metadata Reconstruction Attack (SMRA) is theoretically linked to the identified gaps, demonstrating vulnerabilities in output stability.

Abstract

This paper examines the mathematical foundations of GPT-class large language models through the lens of classical mathematics — functional analysis, measurement theory, and numerical methods. The analysis identifies a systematic pattern: at every stage where rigorous mathematical convergence could not be established, the transformer architecture substitutes engineering approximations — empirical heuristics validated on benchmarks rather than proven in theory. Ten Categories of Mathematical Gaps The paper traces ten categories of unproven approximations in the GPT pipeline to their original publications: Differentiable Relaxation of Discrete Choice — Softmax replaces discrete token selection with a continuous distribution. No theorem establishes that this relaxation preserves the semantic structure — ordering, coreference, logical dependency — of discrete sequences when composed across dozens to hundreds of layers. Gradient Stabilization Without Optimality Proof — Residual connections and layer normalization stabilize training empirically. Convergence is proven only for linear networks (Hardt Hoffmann et al., 2022) are empirical curve fits with no causal mechanism. Schaeffer et al. (2023) showed "emergent abilities" may be measurement artifacts. Embedding Operations Without Invariant Preservation — The pipeline performs dot products, vector additions, and weighted sums on embeddings, each presupposing metric invariants (inner product axioms, additive compatibility, distance preservation across layers). No proof exists that learned embeddings satisfy these invariants globally. Furthermore, no theorem guarantees that semantic classes are geometrically separable in embedding space — distinct meanings can map to overlapping manifold regions, where model similarity ≠ semantic similarity. Attention Mechanism: The Unformalized Core — Scaled dot-product attention has no proof of optimality, stable spectral properties, or bounded error propagation across layers. In-Context Learning: Behavior Without Theory — ICL lacks a formal specification of what is learned, convergence conditions, or stability guarantees across distribution shifts. Feed-Forward Blocks: Two-Thirds Without Formal Characterization — FFN blocks constitute ~67% of transformer parameters with three distinct gaps: no approximation error bounds, only partial post-hoc interpretability (not predictive specification), and zero controllability mechanisms. Approximation, interpretability, and controllability are independent properties — partial progress on one does not address the others. Author's Engineering Observations The paper includes four author's observations drawn from practical agent engineering experience, connecting the mathematical analysis to observable consequences: normalization constraining the geometry available for disambiguation, extended context windows as engineering workarounds for absent invariance proofs, BPE tokenization penalizing non-English languages in precision-critical domains, and overlapping semantic manifolds producing systematic disambiguation failures. Error Composition Each approximation introduces bounded error locally. The paper shows that their composition across dozens to hundreds of layers is completely uncharacterized — no published work describes the error propagation function for a transformer. In contrast, classical numerical methods (e.g., Levinson-Durbin) provide exact error bounds through the condition number κ(A)κ(A). Structural Reconstruction: A Measurable Consequence (§7) The paper's central contribution connects the qualitative analysis to a quantitative measurement. Applying classical autocorrelation analysis to GPT-generated text reveals: For signals with genuine structural regularity: κ(A)106κ(A)>106 — systematically ill-conditioned, approaching computational infinity This divergence is a deductive consequence of the approximation stack, not an empirical curiosity. The condition number κ(A)κ(A) serves as a phase transition detector: a measurable threshold where GPT's locally reasonable approximations become globally unstable. This threshold is invisible to scaling laws, which measure average cross-entropy loss — not worst-case structural stability. The Architectural Argument (§7.5) The paper formulates SMRA as a deductive argument about systems without formal guarantees: A system that (1) cannot distinguish what information its outputs encode, (2) has no runtime mechanism to prohibit any output class, (3) cannot prove any input safe, and (4) retains uncontrollable artifacts of its training trajectory — cannot formally exclude any output class. For any class X, there exists an input that elicits it. SMRA is the constructive proof for X = "output with recoverable structural metadata." This follows from four architectural properties — not from any single component. The vulnerability is structural: it exists at every level (output, composition, training) and cannot be patched by replacing one component. Mitigation Limits (§7.6) The paper analyzes Index Server architectures as a proposed countermeasure and identifies a critical limitation: when Index Server outputs are fed back into the model's training pipeline (a routine practice, Shumailov et al., 2023), the structural fingerprint migrates from the retrieval layer into model weights through a feedback loop — making the original document's structure recoverable from the model itself, without any retrieval step. Relationship to Prior Work This paper provides the mathematical explanation for the existence of the Structural Metadata Reconstruction Attack (SMRA): Chudinov, Y. (2026). Structural Metadata Reconstruction Attack: How Document Outlines Enable LLM-Driven Intellectual Property Extraction. Zenodo. DOI: 10.5281/zenodo.19004697 The comparison with formally grounded systems references the mathematical chain used in: Chudinov, Y. (2026). Dual-Layer SPO Architecture for Embedding-Based Index Ranking. Zenodo. DOI: 10.5281/zenodo.19261510 53 References All claims are traced to original publications. Sources include Vaswani et al. (2017), Kingma & Ba (2014), He et al. (2015), Kaplan et al. (2020), Stevens (1946), Cauchy (1815), Toeplitz (1911), Levinson (1947), Durbin (1960), Bahdanau et al. (2015), and 37 others.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Yurii Chudinov (Mon,) studied this question.

www.synapsesocial.com/papers/69ccb6b416edfba7beb8875f — DOI: https://doi.org/10.5281/zenodo.19337903

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

The Engineering Approximation Stack: A Critical Analysis of GPT's Mathematical Foundations

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion