What question did this study set out to answer?

This study aims to analyze the effects of preprocessing strategies and vocabulary sizes on Myanmar Unigram tokenization.

May 19, 2026Open Access

An Empirical Study of Preprocessing and Vocabulary Effects in Myanmar Unigram Tokenization

Key Points

This study aims to analyze the effects of preprocessing strategies and vocabulary sizes on Myanmar Unigram tokenization.
Evaluated multiple preprocessing variants such as Raw, No-space, Half-mixed, syllable-based, and ZWSP-based segmentation.
Conducted experiments under different vocabulary capacities to analyze token behavior and efficiency.
Analyzed metrics including fragmentation, token compactness, and linguistic stability.
Larger vocabularies reduce fragmentation and improve token compactness.
ZWSP-based segmentation showed lower fragmentation while maintaining stable token boundaries.
Smaller vocabularies produced incomplete Myanmar fragments despite stable decoding integrity.

Abstract

This paper presents an empirical study of preprocessing strategies and vocabulary sizes for Myanmar Unigram tokenization. The study evaluates multiple preprocessing variants, including Raw, No-space, Half-mixed, syllable-based segmentation, and ZWSP-based segmentation under different vocabulary capacities. The experiments analyze compression efficiency, fragmentation behavior, fertility, token compactness, and vocabulary utilization. Results show that larger vocabularies generally reduce fragmentation and improve token compactness, while preprocessing strategies strongly influence linguistic stability. In particular, ZWSP-based segmentation demonstrated lower fragmentation behavior while preserving stable token boundaries. The study further shows that smaller vocabularies may occasionally produce linguistically incomplete Myanmar fragments and isolated combining marks despite stable decoding integrity. This work provides one of the first fragmentation-aware empirical analyses of Myanmar Unigram tokenization behavior across multiple preprocessing conditions.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Khant Sint Heinn (Sun,) studied this question.

synapsesocial.com/papers/6a0bfde8166b51b53d3793ea https://doi.org/https://doi.org/10.5281/zenodo.20257141

Bookmark

View Full Paper