Entity linking is the task of linking textual mentions to entities in a knowledge graph like Wikidata. For morphologically rich languages like Turkish, the primary bottleneck for research in this area is the lack of a large-scale, publicly available corpus. To address this issue, we introduce TurkLink, a new corpus based on Wikipedia and Wikidata. This resource comprises approximately 590,000 articles and 97 million tokens and contains annotations for both entity linking and named entity recognition. A distinguishing feature of TurkLink is its linguistic enrichment, added via an existing NLP tool, which includes part-of-speech tags, morphological analyses, and complete syntactic dependency parse trees for every sentence. Our analysis of the Turklink corpus reveals that it comprises a mix of both lesser-known and more common entities and reflects the structural and morphological challenges of the Turkish language. To provide an initial indication of its utility as a training resource, our preliminary experiment shows that a model trained on TurkLink achieves promising results on the entity disambiguation sub-task, including a micro F1-score of 0.9414 on our internal test set and over 0.90 on the Mewsli-9 3 and Mewsli-X 15 Turkish subsets. TurkLink is released with standard training, validation, and test splits, aiming to serve as a foundational resource for future research on Turkish entity linking. The complete corpus and all associated resources are accessible under the CC BY-SA 4.0 license at https://huggingface.co/datasets/yakdas/turklink-corpus
Akdaş et al. (Thu,) studied this question.