This paper presents an empirical study of preprocessing strategies and vocabulary sizes for Myanmar Unigram tokenization. The study evaluates multiple preprocessing variants, including Raw, No-space, Half-mixed, syllable-based segmentation, and ZWSP-based segmentation under different vocabulary capacities. The experiments analyze compression efficiency, fragmentation behavior, fertility, token compactness, and vocabulary utilization. Results show that larger vocabularies generally reduce fragmentation and improve token compactness, while preprocessing strategies strongly influence linguistic stability. In particular, ZWSP-based segmentation demonstrated lower fragmentation behavior while preserving stable token boundaries. The study further shows that smaller vocabularies may occasionally produce linguistically incomplete Myanmar fragments and isolated combining marks despite stable decoding integrity. This work provides one of the first fragmentation-aware empirical analyses of Myanmar Unigram tokenization behavior across multiple preprocessing conditions.
Khant Sint Heinn (Sun,) studied this question.