Algerian Arabic (Darija) dominates digital communication in North Africa yet remains severely under-resourced in Natural Language Processing (NLP), hindering the development of robust applications for social media analysis and e-commerce. This paper addresses this scarcity by presenting a systematic framework for constructing and benchmarking Named Entity Recognition (NER) and Entity Linking (EL) resources tailored to the dialect’s linguistic complexity. We introduce a large-scale, multi-script dataset constructed through a novel hybrid methodology that integrates manual annotation of authentic texts, automated knowledge graph extraction from Wikidata, and rule-based synthetic generation. This approach ensures diverse coverage across ten semantic categories while explicitly addressing the challenges of code-switching and orthographic variation (Arabizi and Arabic script). A transformer-based model (XLM-RoBERTa) fine-tuned on this resource achieves state-of-the-art performance, demonstrating significant robustness compared to existing baselines. Beyond the dataset, we provide a practical deployment interface and comprehensive evaluation metrics, establishing a crucial foundation for advancing NLP capabilities in North African dialects and facilitating downstream tasks such as content moderation and cultural heritage preservation.
Bouarroudj et al. (Mon,) studied this question.