What question did this study set out to answer?

The study aims to reduce hallucinations in large language models by integrating epistemic metadata and ontological frameworks.

April 10, 2026Open Access

VKB-Training: Epistemic Metadata, Ontology Attention, and Cultural Compilers for Hallucination Reduction in Large Language Models

Key Points

The study aims to reduce hallucinations in large language models by integrating epistemic metadata and ontological frameworks.
Developed VKB-Training to categorize training data with six epistemic tags.
Implemented a four-stage hybrid annotation pipeline involving AI and human input.
Proposed eight training mechanisms to enhance model reliability including confidence-weighted loss and ontology attention.
Hypothetical claims on improved model accuracy by addressing epistemic issues were outlined.
Identified seven open problems that require future exploration to validate VKB-Training's effectiveness.

Abstract

Large language models hallucinate because their training data carries no epistemic metadata: facts, hypotheses, value judgments, and acknowledged unknowns occupy the same embedding space with identical weight. A deeper problem compounds this: every claim presupposes an ontology — an axiomatic framework equipped with a metric — and as Bertrand's paradox demonstrates, probability itself is ill-defined without specifying the measure. Deeper still: the same ethical truth can be expressed in culturally distinct "coordinate systems," and collapsing these into a single representation introduces systematic bias. We propose VKB-Training (Verified Knowledge Base Training), a data-centric approach that assigns each training sample a six-category epistemic tag (Fact, Model, Value, Hypothesis, BlindSpot, Ontology), a calibrated confidence score, a provenance chain, and an ontology identifier specifying the axiomatic framework under which the claim is asserted. We introduce a four-stage hybrid annotation pipeline: (1) AI triangulation — multiple LLMs classify independently; inter-model disagreement signals normative content (the "Caesar/God boundary"); (2) Human sampling with axiom extraction — domain annotators resolve high-disagreement cases; recurrent decision principles are extracted as reusable rules; (3) Expert calibration with reputation weighting — formalized Galton's ox-weighing insight (per S.V.E. XI, DOI: 10.5281/zenodo.18109198); (4) Logical consistency filters — contradiction detection and symmetry verification via the CGS Method (DOI: 10.5281/zenodo.18776172). Eight training mechanisms are proposed: (1) confidence-weighted loss; (2) provenance-aware attention; (3) BlindSpot training maximizing output entropy at known knowledge gaps; (4) confidence propagation through DAG-structured knowledge dependencies; (5) temporal embeddings for version-aware knowledge; (6) ontology attention — switching between axiomatic frameworks with entropy-based selection cost; (7) cultural compilers — orthonormal transformations preserving distance to an ethical kernel, with universal archetypal bases discovered via joint diagonalization of cross-cultural covariance matrices (S.V.E. VIII); and (8) CogOS integration — recursive ontology refinement and Lyapunov-stable ethical dynamics (per CogOS, DOI: 10.5281/zenodo.18109244). Meta-ontological transparency. VKB itself operates within the S.V.E. ontological hypothesis (defined in S.V.E. IV, VIII, XII). We make this dependency explicit: the six epistemic categories are postulated, not derived; confidence scores presuppose a probabilistic interpretation; the ethical kernel Φ and δ-dehumanization metric depend on choices we acknowledge but do not resolve. VKB's categories are hypotheses subject to revision through empirical contact with reality, following the S.V.E. feedback loop (Reality → Ontology → Language → Models → Verification → Feedback → Ontology). Honest limitations. The paper reports no experimental results. All quantitative claims are hypothetical. We enumerate seven open problems explicitly: scalability of annotation (unknown required fraction); reductionism risk in the δ-metric (useful heuristic, not a theory of ethics); potential collapse of ontology attention; idealized orthonormality in cultural compilers; absence of experiments (the most important next step); dependency on unpublished S.V.E. preprints (provisional foundation, made explicit); and the "first computable metric" claim (may be incorrect — we welcome corrections). The mathematical argument for non-discriminatory deployment is structural: joint diagonalization requires input from all cultures; excluding cultures violates orthonormality — the mathematics itself enforces non-discrimination. VKB-Training was first described as part of the CogOS framework. Cultural compilers and joint diagonalization originate from S.V.E. VIII (Divine Mathematics). This paper integrates these components into a standalone proposal with a falsifiable experimental protocol and pre-specified success thresholds. Section 7 (Ethical Data Sourcing: Author Revenue Sharing, 10–50%) is included in the preprint but will be omitted from the workshop submission. NOTE: ILLUSTRATIVE NUMBERS — WIP Prepared for submission to NeurIPS 2025 Workshops.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Artiom Kovnatsky

Actions

Institutions

Laboratoire Spécification et Vérification

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VKB-Training: Epistemic Metadata, Ontology Attention, and Cultural Compilers for Hallucination Reduction in Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider