June 3, 2024Open Access

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

Key Points

Key points are not available for this paper at this time.

Abstract

Common Crawl is the largest freely available collection of web crawl data and one of the most important sources of pre-training data for large language models (LLMs). It is used so frequently and makes up such large proportions of the overall pre-training data in many cases that it arguably has become a foundational building block for LLM development, and subsequently generative AI products built on top of LLMs. Despite its pivotal role, Common Crawl itself is not widely understood, nor is there much reflection evident among LLM builders about the implications of using Common Crawl's data. This paper discusses what Common Crawl's popularity for LLM development means for fairness, accountability, and transparency in generative AI by highlighting the organization's values and practices, as well as how it views its own role within the AI ecosystem. Our qualitative analysis is based on in-depth interviews with Common Crawl staffers and relevant online documents.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Stefan Baack (Mon,) studied this question.

www.synapsesocial.com/papers/68e665f2b6db6435875f20e4 — DOI: https://doi.org/10.1145/3630106.3659033

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion