March 3, 2026Open Access

Evaluating the Influence of Synthetic Data on LLM Performance

Key Points

Using synthetic data appears to deteriorate model capabilities, which raises concerns on data quality.
Models trained on less human data outperformed those trained on datasets enriched with synthetic content.
Analysis utilized the Fineweb dataset and a detection model from UncovAI to assess pre-training behaviors.
Highlights the importance of human-generated content in enhancing performance of large language models.

Abstract

An overwhelming amount of generated content can be found online and in large language model (LLM) training datasets. This raises the question of the effect of such data, generated in uncontrolled environments, on the pre-training of these models. In this paper, we use an open-source dataset called Fineweb and the generated content detection model provided by UncovAI to analyze the pre-training behaviour of LLM models on a dataset containing synthetic data, one in which we remove the synthetic data and one where we have removed part of the human data. We show that using synthetic data seems to deteriorate the model's capabilities and that the model trained on less but human data performs better.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Barbaro et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69a75faac6e9836116a2b419

Authors

Florian Barbaro

Anna Dyka

Fabio Palumbo

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating the Influence of Synthetic Data on LLM Performance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion