We've built a multilingual system for finding and understanding names (properly called 'Named Entity Recognition) and a 'Retrieval-Augmented Generation' system for a smart chatbot for Uzbekistan's telecom services. Importantly, we did this building all the pieces by hand, so to speak. We didn't use the easier routes of using libraries like Hugging Face Transformers or LangChain, because we wanted to be able to see exactly how everything works, change it to our needs, and be sure it's ready for actual use in a situation where Uzbekistan's language isn' and doesn't have tons of existing resources. The name-finder combines FastText, which gives words a numerical representation, characteristics of words as they're spelled (using a 'Character CNN) and what language each word is from, and pushes all this through a bidirectional LSTM (a way of processing sentences from both directions). After that, a specially designed CRF layer does the final organization and figuring out of the names within the sentence, using the forward algorithm and Viterbi decoding. We initially trained this on the English CoNLL-2003 and Russian WikiANN collections of text and names, then improved it further on data we made ourselves representing an Uzbek telecom operator, and we get an F1 score of 0.786 for English, 0.817 for Russian and a perfect 1.000 forcorrectly identifying telecom-specific terms like prices, USSD codes, and services. The Retrieval-Augmented Generation system has a custom way of splitting text into units of 16,000, a BERT-style Transformer with 42.1 million parameters (that we trained to predict masked words on bits of Wikipedia and BookCorpus), and SimCSE, a contrastive method for refining the model. To retrieve the best information, we use a mix of TF-IDF and FAISS, and combine their scores, and it finds the correct answer in the top 5 attempts 98.5% of the time (MRR@5, NDCG@5, and Recall@5 all equal 0.985) on 1015 questions and answers related to the telecom domain. Using a 'quantized' version of Qwen2.5-7B and keeping track of how sure the system is, we make sure that when it's unsure, it gives a safe response. Right now, both of these parts learn from labeled examples and by comparing similar texts. However, they're the essential 'understanding' (NER) and 'memory' (RAG) for a future chatbot that will learn from its own interactions using reinforcement learning. We are explaining the way the systems are built, how they're trained and what the results are, and then we'll talk about how to use reinforcement learning with human feedback, and actor-critic optimization, to improve telecom conversations. All the code, the way the data is prepared and the evaluation tools areavailable to anyone.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dostonbek Abdurakhmonov
Building similarity graph...
Analyzing shared references across papers
Loading...
Dostonbek Abdurakhmonov (Mon,) studied this question.
www.synapsesocial.com/papers/69d894ad6c1944d70ce0599d — DOI: https://doi.org/10.5281/zenodo.19451409