What question did this study set out to answer?

The aim is to develop a multilingual chatbot system for the Uzbek telecom sector that effectively recognizes names and retrieves relevant information.

April 10, 2026Open Access

From Scratch: Multilingual BiLSTM-CRF NER and Hybrid RAG Systems for an Intelligent Uzbek Telecom Operator Chatbot – Foundations for Reinforcement Learning Agents

Puntos clave

The aim is to develop a multilingual chatbot system for the Uzbek telecom sector that effectively recognizes names and retrieves relevant information.
Built a multilingual NER system from scratch using BiLSTM and CRF for name recognition.
Developed a Retrieval-Augmented Generation system with a custom Transformer architecture and scoring methods.
Trained on various datasets, including CoNLL-2003 and WikiANN, then fine-tuned with telecom-specific data.
Achieved F1 scores of 0.786 for English, 0.817 for Russian, and 1.000 for telecom-specific terms.
The RAG system accurately retrieves answers in the top 5 attempts 98.5% of the time on telecom-related queries.

Resumen

We've built a multilingual system for finding and understanding names (properly called 'Named Entity Recognition) and a 'Retrieval-Augmented Generation' system for a smart chatbot for Uzbekistan's telecom services. Importantly, we did this building all the pieces by hand, so to speak. We didn't use the easier routes of using libraries like Hugging Face Transformers or LangChain, because we wanted to be able to see exactly how everything works, change it to our needs, and be sure it's ready for actual use in a situation where Uzbekistan's language isn' and doesn't have tons of existing resources. The name-finder combines FastText, which gives words a numerical representation, characteristics of words as they're spelled (using a 'Character CNN) and what language each word is from, and pushes all this through a bidirectional LSTM (a way of processing sentences from both directions). After that, a specially designed CRF layer does the final organization and figuring out of the names within the sentence, using the forward algorithm and Viterbi decoding. We initially trained this on the English CoNLL-2003 and Russian WikiANN collections of text and names, then improved it further on data we made ourselves representing an Uzbek telecom operator, and we get an F1 score of 0.786 for English, 0.817 for Russian and a perfect 1.000 forcorrectly identifying telecom-specific terms like prices, USSD codes, and services. The Retrieval-Augmented Generation system has a custom way of splitting text into units of 16,000, a BERT-style Transformer with 42.1 million parameters (that we trained to predict masked words on bits of Wikipedia and BookCorpus), and SimCSE, a contrastive method for refining the model. To retrieve the best information, we use a mix of TF-IDF and FAISS, and combine their scores, and it finds the correct answer in the top 5 attempts 98.5% of the time (MRR@5, NDCG@5, and Recall@5 all equal 0.985) on 1015 questions and answers related to the telecom domain. Using a 'quantized' version of Qwen2.5-7B and keeping track of how sure the system is, we make sure that when it's unsure, it gives a safe response. Right now, both of these parts learn from labeled examples and by comparing similar texts. However, they're the essential 'understanding' (NER) and 'memory' (RAG) for a future chatbot that will learn from its own interactions using reinforcement learning. We are explaining the way the systems are built, how they're trained and what the results are, and then we'll talk about how to use reinforcement learning with human feedback, and actor-critic optimization, to improve telecom conversations. All the code, the way the data is prepared and the evaluation tools areavailable to anyone.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Dostonbek Abdurakhmonov

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

From Scratch: Multilingual BiLSTM-CRF NER and Hybrid RAG Systems for an Intelligent Uzbek Telecom Operator Chatbot – Foundations for Reinforcement Learning Agents

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study