What question did this study set out to answer?

This research aims to create a large-scale corpus tailored for entity linking in Turkish, addressing existing resource gaps.

March 23, 2026Open Access

TurkLink: A Morphologically-Aware and Syntactically Enriched Corpus for Entity Linking in Turkish

Key Points

This research aims to create a large-scale corpus tailored for entity linking in Turkish, addressing existing resource gaps.
Developed a corpus by utilizing Wikipedia and Wikidata contents.
Included annotations for entity linking and named entity recognition.
Applied linguistic enrichment using natural language processing tools.
Analyzed structural challenges of Turkish language through corpus data.
TurkLink contains about 590,000 articles and 97 million tokens.
Achieved a micro F1-score of 0.9414 on internal tests for entity disambiguation.
Delivered high performance scores (over 0.90) on other Turkish datasets, Mewsli-9 and Mewsli-X.

Abstract

Entity linking is the task of linking textual mentions to entities in a knowledge graph like Wikidata. For morphologically rich languages like Turkish, the primary bottleneck for research in this area is the lack of a large-scale, publicly available corpus. To address this issue, we introduce TurkLink, a new corpus based on Wikipedia and Wikidata. This resource comprises approximately 590,000 articles and 97 million tokens and contains annotations for both entity linking and named entity recognition. A distinguishing feature of TurkLink is its linguistic enrichment, added via an existing NLP tool, which includes part-of-speech tags, morphological analyses, and complete syntactic dependency parse trees for every sentence. Our analysis of the Turklink corpus reveals that it comprises a mix of both lesser-known and more common entities and reflects the structural and morphological challenges of the Turkish language. To provide an initial indication of its utility as a training resource, our preliminary experiment shows that a model trained on TurkLink achieves promising results on the entity disambiguation sub-task, including a micro F1-score of 0.9414 on our internal test set and over 0.90 on the Mewsli-9 3 and Mewsli-X 15 Turkish subsets. TurkLink is released with standard training, validation, and test splits, aiming to serve as a foundational resource for future research on Turkish entity linking. The complete corpus and all associated resources are accessible under the CC BY-SA 4.0 license at https://huggingface.co/datasets/yakdas/turklink-corpus

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Akdaş et al. (Thu,) studied this question.

synapsesocial.com/papers/69c0ddb8fddb9876e79c12bf https://doi.org/https://doi.org/10.1016/j.procs.2026.01.041

Bookmark

View Full Paper