March 3, 2026Open Access

Development of hybrid approach for named entity recognition in Uzbek language text

Key Points

Named entity recognition accuracy increases significantly with the hybrid algorithm in varied domains.
The accuracy and completeness of NER exceeded traditional methods, marking a clear improvement.
Assessment included a novel annotated corpus of over three thousand sentences across multiple text types.
The approach addresses unique challenges of the Uzbek language, expanding recognition methods for low-resource languages.

Abstract

During the article a hybrid named-entity recognition (NER) algorithm for Uzbek is presented. It combines rule-based modules (transliteration, dialect normalization, morphological analysis) with modern neural network models. The study is motivated by Uzbek’s agglutinative morphology, dialect diversity and the lack of specialized resources, which hinder the direct application of named entity recognition methods developed for English or other high-resource languages. As part of the work, an annotated corpus of more than three thousand sentences in the Uzbek language was formed, including legal documents, scientific articles, news materials and informal texts from social networks. The corpus is marked up according to the BIOES scheme taking into account the specific morphological and lexical features of the Uzbek language. The developed rule-oriented algorithms (transliteration, dialect standardization, morphological analysis) are integrated into a single post-processing system that complements neural network models. As a result of experiments aimed at assessing the effectiveness of the proposed approach, it was found that the hybrid approach significantly improves the accuracy and completeness metrics of named entity recognition in different thematic domains. The practical value of the study is that the proposed system can serve as a basis for automatic processing of Uzbek texts in the tasks of searching and extracting information, dialect normalization, annotating large text data and digitalization of document flow. The theoretical significance is that the work expands approaches to named entity recognition for low-resource languages, offering methods that take into account morphological-syntactic and dialectal features.

Development of hybrid approach for named entity recognition in Uzbek language text

Key Points

Abstract

Cite This Study