March 3, 2026Open Access

The importance of morphology-aware subword tokenization for NLP tasks in Slovak language modeling

Key Points

Incorporating morphology-aware tokenization improves semantic understanding and model performance.
SK_Morph_BLM achieved significant gains of up to +12.49% in semantic similarity tasks.
Analysis was conducted on a Slovak corpus using two different tokenizers for comparison.
Findings suggest morphology-aware approaches enhance robustness in language models designed for low-resource languages.

Abstract

To effectively train large language models (LLMs) for morphologically rich and low-resource languages such as Slovak, high-quality tokenization is essential. Traditional approaches like Byte-Pair Encoding (BPE) overlook linguistic structure, often fragmenting root morphemes and causing semantic loss. This study examines whether morphology-aware tokenization can improve model performance across various NLP tasks. We introduce the SlovaK Morphological Tokenizer (SKMT), which incorporates root morpheme information into the tokenization process, and compare it against a standard BPE tokenizer. Both tokenizers were used to preprocess a Slovak corpus for pretraining two RoBERTa-based models (SKMorphBLM and SKBPEBLM), which were then fine-tuned on token classification, sequence classification, question answering, and semantic textual similarity tasks. Experimental results show that SKMorphBLM achieved slightly higher performance overall, with statistically significant gains in semantic similarity (up to +12. 49%) and question answering (up to +3. 23%). Complementary quantitative and qualitative analyses further revealed that morphology-aware tokenization is most effective for shorter, morphologically regular texts and improves grammatical and semantic consistency. These findings demonstrate that incorporating morphological information into tokenization can enhance model robustness and semantic understanding for morphologically rich languages.

The importance of morphology-aware subword tokenization for NLP tasks in Slovak language modeling

Key Points

Abstract

Cite This Study