This preprint presents empirical evidence of four related vulnerabilities in large language model systems that combine to produce a novel threat class — the Structural Metadata Reconstruction Attack (SMRA). Discovery Context I discovered the vulnerability while benchmarking two specification-querying architectures: a deterministic MCP-based navigator (described in the predecessor paper, DOI: 10.5281/zenodo.18944351) and a standard context-stuffing (naive RAG) approach. The anomaly was first observed and characterized across the full Anthropic model spectrum (Haiku, Sonnet, Opus) — from the smallest to the largest model — because these were the models integrated into the benchmarking pipeline. Anthropic was the discovery platform, not the target: the choice was driven by tooling availability, not vendor selection. Full cross-vendor reproduction with 10 models from 3 vendors (Anthropic, OpenAI, Google) — including both entry-level and flagship models — confirmed the mechanism is systemic across all major LLM providers (see Cross-Vendor Reproduction below). The naive baselines exhibited anomalous fabrication patterns that could not be explained by standard hallucination models — specifically, WHY-type and conditional (WHEN-type) queries produced the most aggressive and structurally coherent fabrications, while HOW and WHAT queries showed markedly lower fabrication rates. As the sole author of the target specification (~700 pages, written over one year, unpublished), I possess complete knowledge of every section's content and was therefore uniquely positioned to recognize that LLM outputs — while structurally faithful, terminologically authentic, and superficially authoritative — systematically inverted the specification's deliberate departures from industry conventions. A parallel verification confirmed that the specification's original coinages are absent from CS literature (Google Scholar, ACM DL, IEEE Xplore, arXiv), ensuring that every fabricated claim originates from the model's training priors projected onto the document's table of contents, not from memorized source text. Four Findings Finding 1 — Structural Metadata Reconstruction Attack (SMRA). When an LLM receives a document's table of contents (TOC) without body text, it systematically reconstructs plausible but fabricated content by projecting training knowledge onto structural metadata. In a controlled experiment using a proprietary specification containing original coinages absent from any training corpus, 10 models from 3 vendors (Anthropic: Haiku, Sonnet, Opus; OpenAI: GPT-4o, GPT-4o-mini; Google: Gemini 2.0 Flash, Gemini 2.5 Pro, Gemini 3.0 Flash, Gemini 3.0 Pro) produce SMRA rates of 8–28% under naive conditions while using the author's terminology, citing real section numbers, and reading as authoritative. The mechanism is systemic across all major LLM providers, model tiers, and architecture generations. Finding 2 — Confidence–Capability Inversion (CCI). Stronger models are not merely wrong — they are more dangerously wrong. Under structural metadata leakage, Opus produces zero honest refusals across 20 questions where 18 require absent information, while Haiku refuses 9 times. Each step up the capability ladder produces proportionally less detectable fabrication with fewer epistemic signals. Finding 3 — RAG Scope Mismatch. The trigger condition — metadata scope exceeding content scope — is not an exotic scenario but the default architecture of most RAG systems. Standard practice (include document TOC + section summaries for "context") creates exactly the fabrication surface demonstrated in Findings 1 and 2. Finding 4 — Scope Displacement as Content Extraction. A question about absent content does not merely trigger fabrication — it acts as an extraction query that reorganizes real content from loaded sections into a derivative document the author never wrote. Even without TOC leakage, the question itself is sufficient to extract and restructure loaded content into a form optimized for the questioner's purpose. This transforms hallucination from an accuracy problem into unauthorized intelligence gathering. Cross-Vendor Reproduction The SMRA mechanism was characterized across 10 models from 3 vendors, spanning entry-level to flagship tiers. All models were tested under 5 experimental conditions: A (full-TOC), A' (no-summary), B (mini-TOC), C (MCPi — tool-assisted retrieval), and D (MCPi + grounding prompt). Vendor Models Model tier Naive SMRA rate MCPi SMRA rate Convergence pattern Anthropic Haiku, Sonnet, Opus Entry → flagship 13–28% 1.3–5.0% CCI gradient; Opus worst naive, best MCPi refusal rate OpenAI GPT-4o, GPT-4o-mini Mid → flagship 8–19% 0.8% Lowest MCPi SMRA; GPT-4o best overall performer Google Gemini 2.0 Flash, 2.5 Pro, 3.0 Flash, 3.0 Pro Entry → flagship 10–22% 1.3–3.8% Generational improvement; 3.0 Pro cleanest among Google Key convergence: when the specification deliberately departs from industry conventions (e.g., no implicit conversions, nominal typing, fixed-width encoding), models from all three vendors converge on the same wrong answer — the training-data default from C#/Java/Protobuf. Annex I documents 7 semantic clusters where this convergence is strongest. Mechanism: The Two-Key Cipher The reconstruction mechanism is formalized as: Key 1 (TOC) — provides structural scaffolding: section numbers, heading text, hierarchical organization Key 2 (Training corpus) — provides domain content: standard CS patterns, common PL conventions Neither key alone enables reconstruction. Together, they produce confident, section-cited, terminologically authentic fabrications that would pass casual review by a non-specialist. The mechanism is architecturally inevitable: multi-head attention over near-complete domain coverage in training data means that 7–10% of structural information suffices for full content reconstruction. Quantitative Contributions Calibration Retention Rate (CRR) — measures how much epistemic calibration a model retains under metadata leakage (Opus: 0%, Haiku: 47%) SMRA-score — per-question metric combining fabrication detection, source attribution, and epistemic signal presence Information-theoretic quantification — formal analysis of reconstruction threshold as a function of heading informativeness and training corpus coverage Fabrication taxonomy (Annex C) — five categories of structural metadata fabrication with examples Implications RAG system design: >80% of production RAG deployments use the vulnerable architecture (metadata scope > content scope) Data classification: Existing frameworks (GDPR, HIPAA, PCI DSS, ISO 27001, NIST SP 800-53, SOC 2, DTSA, EU Directive 2016/943) classify sensitivity by content — a TOC contains no PII, so it is "non-sensitive." SMRA invalidates this: structural metadata from a confidential source inherits that source's confidentiality, because a language model can reconstruct the protected content from metadata alone. Organizations must reclassify structural metadata as sensitive data. Regulatory blind spot: Neither EU AI Act nor US Executive Order 14110 (revoked 20 January 2025) addresses context-design-driven vulnerabilities Model evaluation: Standard "helpfulness" and "coherence" metrics reward confident fabrication — SMRA-affected outputs score highly on both Intellectual property exposure: Any structured document with descriptive headings becomes vulnerable when its outline is accessible alongside an LLM Mitigation A single architectural fix — grounded retrieval via an MCP Index Server (MCPi) (a Model Context Protocol server with deterministic, index-based navigation) — reduces SMRA rates from 16–18% (naive) to 2–3% (MCPi). Under MCPi conditions, even the weakest model achieves dramatic improvement, and the best performer (GPT-4o) reaches 0.8% SMRA. Adding a grounding prompt (Condition D) provides marginal additional improvement (aggregate: 3.0% → 2.2%). Architecture beats parameters. Deterministic retrieval infrastructure (weighted indexes, tier-based extraction, algorithmic reading plans) also provides an enforceable control point for sensitive data — unlike probabilistic RAG, where metadata is injected into context and the model decides what to do with it, deterministic retrieval makes the scope boundary structurally auditable. Practitioner Protocol Annex H provides a complete testing protocol for assessing RAG deployments against SMRA: Calibration baseline → exploit comparison methodology Token analysis and honest refusal tracking Decision thresholds for remediation Scope alignment implementation patterns (Annex F) Supplementary Materials Annex A–D: Claim classification definitions, per-question token analysis, fabrication taxonomy, SMRA attack algorithm Annex E: Author-coined term verification (10 terms, 4 search engines, 0 matches) Annex F: RAG scope alignment implementation patterns (3 remediation architectures) Annex G: CCI formal definition and severity scale Annex H: SMRA testing methodology for practitioners Annex I: Canary word cluster projection — 7 semantic clusters extracted from 160 naive-condition runs across 8 models, convergence scoring (up to 7/8 models converging), model capability profiles (4 behavioral types), endianness split analysis, and cross-model escalation projections (3× amplification factor) Companion Data All benchmark data supporting this paper are included: Raw answer dumps (20 questions × 10 models × 5 conditions = 960 runs) Calibration baselines (mini-TOC control) and exploit runs (full-TOC) Cross-vendor comparison matrix Token usage and timing data per question per model The 20 evaluation questions targeting out-of-scope specification content Detailed evidence analysis (toc-leakage-analysis.md) — step-by-step fabrication mechanism documentation with heading-to-claim mapping tables, side-by-side comparisons against real specification text, proof-of-source tests, fabric
Building similarity graph...
Analyzing shared references across papers
Loading...
Yurii Chudinov (Thu,) studied this question.
www.synapsesocial.com/papers/69b606af83145bc643d1cead — DOI: https://doi.org/10.5281/zenodo.19004697
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yurii Chudinov
Building similarity graph...
Analyzing shared references across papers
Loading...