An overwhelming amount of generated content can be found online and in large language model (LLM) training datasets. This raises the question of the effect of such data, generated in uncontrolled environments, on the pre-training of these models. In this paper, we use an open-source dataset called Fineweb and the generated content detection model provided by UncovAI to analyze the pre-training behaviour of LLM models on a dataset containing synthetic data, one in which we remove the synthetic data and one where we have removed part of the human data. We show that using synthetic data seems to deteriorate the model's capabilities and that the model trained on less but human data performs better.
Building similarity graph...
Analyzing shared references across papers
Loading...
Barbaro et al. (Wed,) studied this question.
Florian Barbaro
Anna Dyka
Fabio Palumbo
Building similarity graph...
Analyzing shared references across papers
Loading...