To effectively train large language models (LLMs) for morphologically rich and low-resource languages such as Slovak, high-quality tokenization is essential. Traditional approaches like Byte-Pair Encoding (BPE) overlook linguistic structure, often fragmenting root morphemes and causing semantic loss. This study examines whether morphology-aware tokenization can improve model performance across various NLP tasks. We introduce the SlovaK Morphological Tokenizer (SKMT), which incorporates root morpheme information into the tokenization process, and compare it against a standard BPE tokenizer. Both tokenizers were used to preprocess a Slovak corpus for pretraining two RoBERTa-based models (SKMorphBLM and SKBPEBLM), which were then fine-tuned on token classification, sequence classification, question answering, and semantic textual similarity tasks. Experimental results show that SKMorphBLM achieved slightly higher performance overall, with statistically significant gains in semantic similarity (up to +12. 49%) and question answering (up to +3. 23%). Complementary quantitative and qualitative analyses further revealed that morphology-aware tokenization is most effective for shorter, morphologically regular texts and improves grammatical and semantic consistency. These findings demonstrate that incorporating morphological information into tokenization can enhance model robustness and semantic understanding for morphologically rich languages.
Držík et al. (Thu,) studied this question.