This paper argues that heavy use of synthetic (model-generated) data in training and fine-tuning large language models creates a structural risk of “hall of illusions” collapse: models become increasingly well-adapted to their own outputs while drifting away from real-world data, especially in the long tail. I formalize a simple synthetic feedback loop where a model trained on real data is then repeatedly retrained on mixtures of real and fake data. Using two transparent toy experiments—a 2D Gaussian mixture model and a tiny character-level n-gram language model—I show that as the synthetic fraction α and the number of generations increase, performance on held-out real data degrades and eventually collapses. In both cases, metrics on real test sets stay stable with no synthetic data, degrade under moderate synthetic use, and fail sharply when synthetic data dominates. The paper introduces the metaphor of a “hall of illusions” and a mirror-cavity analogy to explain why this behaviour is structurally expected, not an anomaly. Beyond the toy setups, the paper discusses implications for real LLM pipelines, surveys partial mitigations (self-critique, preference models, process supervision, diversification), and argues that they do not remove the underlying risk at high synthetic fractions. I propose concrete tests and disclosure requirements—including reporting approximate synthetic fractions, running multi-generation collapse tests, and stress-testing long-tail performance—as a minimum standard before synthetic data can safely become a central pillar of scaling. Figures and example code for the toy experiments are included to make the results easy to reproduce.
Building similarity graph...
Analyzing shared references across papers
Loading...
Lei Yu (Mon,) studied this question.
www.synapsesocial.com/papers/69402c4d2d562116f29029db — DOI: https://doi.org/10.5281/zenodo.17782033
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Lei Yu
Building similarity graph...
Analyzing shared references across papers
Loading...