May 16, 2026Open Access

Adverse Treatment Rates in AI-Generated Legal Citations: A Cross-Model Empirical Analysis of Citation Validity Across Four Frontier Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Recent attention to artificial intelligence hallucination in legal contexts has focused on fabricated citations: cases that do not exist. This paper identifies a more consequential failure mode: real citations to overruled, abrogated, or vacated authority. We tested four frontier language models (Grok-3, GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Flash Preview) by presenting identical legal questions and collecting 805 citations from their responses. After confirming 441 citations as existing in the CourtListener judicial opinion database, we performed treatment analysis on each using a cross-vendor adversarial verification architecture in which two independent model families classify citing opinions for negative treatment signals. Of the 441 verified citations from the four standard-tier models, 30 unique citations (6.8%) pointed to authority that has been overruled, abrogated, vacated, or otherwise negatively treated. An additional 17 citations produced model disagreement on treatment status. Every model tested cited dead law at rates between 5.8% and 7.9%. These rates exceed the fabrication rates reported in Section IV.A for three of four standard-tier models tested. An inverse correlation was observed between fabrication rates and dead-law citation rates: the model with the lowest fabrication rate (Grok-3, 0%) had the highest dead-law rate (7.9%). Additionally, Anthropic's flagship model Opus 4.7 was tested on the same prompt corpus, producing a 6.1% dead-law rate and 0.5% fabrication rate, demonstrating that increased model capability reduces but does not eliminate dead-law citation risk. We validated the treatment analysis methodology against a ground-truth dataset of 64 known overruled Supreme Court cases derived from the Stanford RegLab dataset, achieving 96.9% recall with Stage 1 analysis and 100% detection (confirmed or flagged) with the full pipeline. We argue that dead-law citations present greater malpractice risk than fabricated ones because they pass every surface-level verification check, and we discuss implications for legal AI procurement and professional responsibility.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Eric Swidey

Actions

Institutions

Third Wave Systems (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Adverse Treatment Rates in AI-Generated Legal Citations: A Cross-Model Empirical Analysis of Citation Validity Across Four Frontier Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study