This article sets out ways that corpus literacy can be taught in the digital humanities classroom to illuminate for students the practical steps and curatorial decisions that go into constructing a corpus, and the implications of these decisions for the computational text analysis that follows. It proposes a framework that resonates with the principles of minimal computing while also leading students to interrogate the resources and the labor required for constructing textual corpora. It suggests critical readings that can be used in tandem with an exploration of the Google Ngram Viewer and the Google Books project whose data underlies it, as a way into understanding the limitations of Google's digitization project and the importance of reliable metadata and robust OCR (optical character recognition), as well as the historical contingency of projects claiming to widen access to information. It lays out ways to lead students through the practicality of building their own corpus, from undertaking OCR on their own devices to the cleaning and structuring steps that, undertaken collaboratively with others, bring awareness to concerns including file naming conventions, logical directory structures, accurate metadata, and version control, while also fostering the crucial digital humanities (DH) skill of being able to work collaboratively. This kind of corpus literacy is, I argue, not only compatible with a minimal computing approach but one of the starting points from which a broader program of critical AI literacy might begin.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anouk; id_orcid 0000-0001-9597-1026 Lang (Fri,) studied this question.
Anouk; id_orcid 0000-0001-9597-1026 Lang
Building similarity graph...
Analyzing shared references across papers
Loading...