When a language model confidently states that the capital of Germany is Paris, somethinghas gone wrong inside the model before that word ever appears. This paper investigateswhat. We ask whether hallucinations follow consistent, detectable patterns in the internalactivations of transformer models prior to the generation of an incorrect token, and findthat they do. Through experiments on a custom 806K-parameter transformer and GPT-2(124M parameters), tested across 20,000 factual prompts in 7 knowledge categories, weidentify two named phenomena. Relation Dropout: attention to the semantic relationtoken collapses in the final transformer block of small models before a hallucination occurs.Last-Layer Suppression: factual knowledge emerges correctly in blocks 10–11 of GPT-2but is systematically overridden by block 12. We propose a three-type hallucination taxonomy,release HallScan (pip install hallscan), an open-source detection tool, and HallBench,a labeled benchmark of 20,000 annotated examples.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nikhil Upadhyay
Building similarity graph...
Analyzing shared references across papers
Loading...
Nikhil Upadhyay (Mon,) studied this question.
www.synapsesocial.com/papers/69f6e62e8071d4f1bdfc6cb6 — DOI: https://doi.org/10.5281/zenodo.19934537