What question did this study set out to answer?

The research aims to enhance the efficiency of automated request categorization using pre-trained language models.

April 12, 2026Open Access

Automatic Categorization of Textual Messages Using Pre-Trained Language Models

Key Points

The research aims to enhance the efficiency of automated request categorization using pre-trained language models.
Developed a universal method for categorizing text requests using a pre-trained Sentence-BERT model.
Investigated low efficiency of pre-trained models in specialized domains and applied contrastive retraining.
Conducted systematic comparison of four methods: baseline model, unsupervised contrastive learning, supervised retraining with CosineSimilarityLoss, and MNRL.
Utilized a dataset of 6,500 Russian-language queries labeled into 16 categories, assessing clustering with internal and external metrics.
The MNRL method improved clustering quality by 123% for Purity, 233% for NMI, and 658% for ARI compared to the baseline model.
Implemented a mechanism for assessing classification confidence using individual Silhouette Scores.
Demonstrated that the approach is adaptable for request processing across various domains with limited labeled data.

Abstract

Help desks at various organizations receive hundreds and thousands of requests from users daily. Manually sorting these requests takes considerable time and often leads to routing errors, reducing the speed and quality of customer service. Automating the request categorization process is a pressing issue for companies of all types, including IT support, medical institutions, banks, government agencies, and online stores. This paper proposes a universal method for automatically sorting text requests into categories using a pre-trained Sentence-BERT (SBERT) neural network model. The low efficiency of pre-trained language models when working with texts from highly specialized subject areas is investigated. To address this issue, contrastive retraining of the model on domain-specific data was applied, significantly improving the quality of vector text representations. A systematic comparison of four approaches was conducted: a baseline model without retraining, unsupervised contrastive learning on unlabeled data, supervised retraining using the CosineSimilarityLoss criterion, and retraining using the Multiple Negatives Ranking Loss (MNRL) criterion. Experiments were conducted on a dataset of 6,500 Russian-language queries, of which 1,119 were labeled into 16 categories. Both internal metrics (Silhouette Score, Davies-Bouldin Index) and external ones (Purity, NMI, ARI) were used to assess clustering quality. The MNRL method demonstrated the best results: clustering quality increased by 123% for Purity, 233% for NMI, and 658% for ARI compared to the baseline model. A mechanism for assessing classification confidence based on an individual Silhouette Score for each query is proposed, allowing uncertain cases to be redirected for manual processing. The developed approach is universal and can be adapted to automate the processing of requests in any subject area with 10-20% of labeled data.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

A. N. Isenbaev

I. M. Yannikov

Journals

Intellekt Sist Proizv

Actions

Institutions

Izhevsk State Technical University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Automatic Categorization of Textual Messages Using Pre-Trained Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study