What question did this study set out to answer?

This paper aims to validate an external measurement instrument for assessing AI governance in regulated financial institutions.

May 3, 2026Open Access

The Stationary Sea (Part 1: Substrate Construction): Measurement Instrument Validation for External Assessment of AI Governance in Regulated Financial Institutions

Key Points

This paper aims to validate an external measurement instrument for assessing AI governance in regulated financial institutions.
Validated an instrument through empirical testing involving 543 regulated institutions and 626,390 governance edges across 66 countries.
Utilized a four-state evidence taxonomy to categorize data and implemented an Empirical Prediction Architecture for structured validation.
Conducted a capture-replay audit to assure determinism of outputs across different scanner versions.
Total edges per agent remained stable at 6.53, confirming architectural envelope stability.
Sector-level governance scores showed a mean range of 8.82 points, validating known-groups validity with stable results across different sectors.
v13.1.0 scanner recovered more observed evidence compared to v11.1, indicating expanded capability and coverage of the scanning process.

Abstract

This paper reports the validation of an external measurement instrument for AI governance across regulated financial institutions and documents three architectural commitments that have emerged through the v13. 1. 0 production cycle: a four-state evidence taxonomy (Observed, Derived, Inferred, Unknown; the four states abbreviate as ODIU) 1 that extends the three-tier classification of earlier scanner versions, an Empirical Prediction Architecture (EPA) that anchors validation in pre-specified falsifiable predictions, and a three-score family at the composite level whose substrate-level implications are articulated here while the rating methodology that operates on top of the substrate is documented in the companion publication. The instrument extracts governance topology from public evidence using large language models supplemented by web-scale search and the four-state evidence classification. The paper documents a three-era scanner lineage (v11. 1 baseline, v11. 1. 1 attribution overlay, v12. 1. 1 production scanner) extended by a v13 corrective iteration locked at SHA256 e5250de8e9de07d6 and sealed as v13. 1. 0-production-day-zero-full on 29 April 2026, with cryptographic hashes at every pipeline stage, a four-source variance attribution framework, a capture-replay validation protocol, and a three-level cross-version comparability framework grounded in classical measurement theory (Cronbach and Meehl, 1955; Messick, 1995; Cronbach, Gleser, Nanda and Rajaratnam, 1972). Empirical validation is drawn from a population-scale campaign of 543 regulated institutions covering 95, 876 AI agents and 626, 390 governance edges across 66 countries, scanned 20 to 29 April 2026 under the v13. 1. 0 scanner. Three findings bear on instrument validity. First, total edges per agent (E/A) remain stable at 6. 53 across the founding cohort scanned 20 to 24 April 2026 and at 6. 41 across the Tier 2 expansion scanned 27 and 28 April 2026, confirming the architectural envelope stability that was reported in the v8 publication of this paper. Second, sector-level mean governance scores track the theoretical gradient of prudential supervisory maturity, from pension funds (12. 10) and sovereign wealth funds (12. 54) through reinsurers (14. 31), investment managers (16. 27), exchanges (18. 29), fintechs (18. 60), and insurers (18. 84) to banks (20. 92), with a between-sector mean range of 8. 82 points exceeding within-sector dispersion (standard deviations 4. 34 to 7. 87) ; the scanner has no sector input during scoring, providing known-groups validity per Cronbach and Meehl (1955). The sector ordering replicates the v8 result with refined estimates from the larger founding cohort. Third, head-to-head comparison of v11. 1 and v13. 1. 0 on the rescanned subset shows that v13. 1. 0 recovers more observed evidence in absolute terms while additionally capturing structural inference that v11. 1 did not recover, confirming that the proportional shift in evidence tier composition reflects expanded coverage rather than a regression in observed-edge recovery. The evidence taxonomy (now four-state) is an epistemic-weight labelling system applied at the edge level, preserving interpretive discretion at the point of analysis. The empirical contribution introduced in the present version of this paper formalises the Empirical Prediction Architecture as a methodology contribution. EPA operationalises the discipline that pre-specified predictions are stated before substrate version sealing and either confirmed or falsified by the resulting empirical data. The architecture distinguishes three classes of prediction: architectural envelope predictions that follow structurally from methodology choices and must hold (e. g. , strict monotonic ordering across the three-score family at the composite level for every institution), pre-specified predictions that constitute the methodology’s empirical bets at substrate sealing time (e. g. , distribution moments, sector gradient ordering), and post-validation observations that surface patterns after sealing and may inform pre-specifications for future substrate versions. The v13. 1. 0 EPA evidence is reported in Section 9: all architectural envelope predictions held, all pre-specified predictions confirmed. The four-state evidence taxonomy formalises a distinction that was implicit in earlier versions of the scanner architecture but not articulated as a fourth state. The three-tier classification (Observed, Derived, Inferred) treated the absence of evidence and the absence of state-knowledge as the same edge property: an edge that was not produced. The four-state taxonomy distinguishes the two cases. Unknown is operationally significant for measurement validity because it allows the instrument to report explicitly that a regulatory or governance dimension has not been ascertained for an institution rather than that the dimension is absent. The taxonomy is documented in Section 6. The three-score family architectural commitment introduced in this paper specifies that the substrate supports three views of any composite score: a full-evidence view operating on the complete edge population, an evidence-stripped view operating on Observed and Derived edges only, and a hard-evidence view operating on Observed edges only. The three views are produced from the same underlying substrate by stripping evidence tiers; they are not independent measurements. The substrate-level implication of the three-score family is the architectural commitment that evidence-tier stripping at the edge level produces a structurally well-defined alternative substrate. The composite-level operationalisation of this commitment, including the specific calibration of the rating methodology under each view, is articulated in the companion publication: Stationary Sea Part 2 (Rating Methodology). This paper does not articulate the rating methodology; it articulates the substrate that the rating methodology operates on. Three hypotheses are tested. H1: the production scanner version has reached diminishing returns in evidence recovery (confirmed by structural exhaustion of context stratification and by E/A stability). H2: aggregate metrics demonstrate population-scale stability under the default sampling regime (confirmed by E/A at 6. 53 across the founding cohort, by sector gradient stability, and by the v13. 1. 0-production-day-zero-full sealed snapshot fingerprinted with SHA256 e5250de8e9de07d6). H3: the instrument satisfies known-groups validity in the classical measurement sense, reproducing the theoretically predicted sector-level governance gradient from public evidence alone, without sector-level input during scoring (confirmed by sector ordering pension funds < sovereign wealth funds < reinsurers < investment managers < exchanges < fintechs < insurers < banks, with a between-sector mean range of 8. 82 points exceeding within-sector dispersion). The paper distinguishes instrument-level determinism from per-scan determinism receipts. Determinism as a property of the instrument was certified at v12. 1. 1 through a capture-replay audit on two reference institutions (anonymised as European G-SIB B and a second G-SIB) producing bit-identical outputs across runs. The v13. 1. 0 scanner inherits the determinism-relevant architecture (model, prompts, temperature, seeding) unchanged from v12. 1. 1, and the property carries forward by architectural inheritance. The DETERMINISTICMODE flag controls whether full per-call receipts and run-metadata sidecars are written to the audit trail; it does not alter scanner behaviour. Within the 543-institution sealed snapshot, a subset was scanned with the flag on (full receipts captured) and the remainder with the flag off (architectural-inheritance regime), with both subsets using the same releaseₕash. From the May 2026 monthly snapshot onward, full per-call receipt capture becomes the operational standard for all production runs. JEL Classification: G28, G38, K22, O33, C63, C81. 1 “Unknown” denotes “we have not established whether the dimension applies”; it is operationally distinct from “no” (observed absence) and from the three known states (Observed, Derived, Inferred). The Unknown state surfaces at the agent level for regulatory text dimensions where the value set is yes, likely, no, unknown, and at the institution level as an explicitly-reported population for any analysis. ODIU is used throughout the paper as the unified four-state mnemonic.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

William M. Collins

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Stationary Sea (Part 1: Substrate Construction): Measurement Instrument Validation for External Assessment of AI Governance in Regulated Financial Institutions

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study