December 1, 2025Open Access

The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance

Key Points

Synthetic data negatively impacts performance on real-world data, causing potential collapse.
As the amount of synthetic data increases, real performance declines and metrics worsen.
The analysis includes feedback loops and several toy models, revealing stability issues under high synthetic use.
It highlights the need for tests and standards to mitigate risks associated with synthetic data in model training.

Abstract

This paper argues that heavy use of synthetic (model-generated) data in training and fine-tuning large language models creates a structural risk of “hall of illusions” collapse: models become increasingly well-adapted to their own outputs while drifting away from real-world data, especially in the long tail. I formalize a simple synthetic feedback loop where a model trained on real data is then repeatedly retrained on mixtures of real and fake data. Using two transparent toy experiments—a 2D Gaussian mixture model and a tiny character-level n-gram language model—I show that as the synthetic fraction α and the number of generations increase, performance on held-out real data degrades and eventually collapses. In both cases, metrics on real test sets stay stable with no synthetic data, degrade under moderate synthetic use, and fail sharply when synthetic data dominates. The paper introduces the metaphor of a “hall of illusions” and a mirror-cavity analogy to explain why this behaviour is structurally expected, not an anomaly. Beyond the toy setups, the paper discusses implications for real LLM pipelines, surveys partial mitigations (self-critique, preference models, process supervision, diversification), and argues that they do not remove the underlying risk at high synthetic fractions. I propose concrete tests and disclosure requirements—including reporting approximate synthetic fractions, running multi-generation collapse tests, and stress-testing long-tail performance—as a minimum standard before synthetic data can safely become a central pillar of scaling. Figures and example code for the toy experiments are included to make the results easy to reproduce.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Lei Yu (Mon,) studied this question.

www.synapsesocial.com/papers/69402c4d2d562116f29029db — DOI: https://doi.org/10.5281/zenodo.17782033

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance· 2025
The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance· 2025
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse· 2024 · 3 citations
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification· 2024 · 6 citations

The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion