March 3, 2026Open Access

Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF

Key Points

The classification model achieved an accuracy of 91.78% after feature selection and oversampling, outperforming the baseline.
Accuracy on the training data reached an impressive 96.53%, but dropped to 61.45% on the test data.
Analysis included 1,314 annotated social media posts, categorized into six distinct levels of politeness.
Multinomial Naive Bayes combined with TF-IDF and SMOTE effectively enhances feature representation in classification.

Abstract

Balinese language is a local language that is widely use and spoken by Balinese people including in social media. However, the nuances of these politeness levels are often lost in informal digital communication and there is a significant lack of computational model to automatically classify them, especially for low-resource language like Balinese. The primary objective of this study is to evaluate the performance of the Multinomial Naive Bayes method combined with Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction, Chi-square feature selection, and Synthetic Minority Oversampling Technique (SMOTE) in classifying Balinese language levels. The dataset for this study consists of 1,314 annotated social media posts and comments, primarily sourced from Instagram. The annotation was conducted by a Balinese language expert to categorize text into six levels that represent varying degrees of politeness and formality. These levels are alus singgih (polite, used for respecting others), alus sor (polite, used for self-humbling), alus mider (polite, used for both respecting others and self-humbling), alus madia (an intermediate level of politeness), basa andap (casual, commonly used in everyday life), and basa kasar (impolite, often used during arguments or toward animals). The experimental results showed that the model successfully achieved an accuracy of 96.53% on the training data and 61.45% on the test data. Additionally, hyperparameter tuning revealed that the Multinomial Naive Bayes model with 2,720 selected features and SMOTE oversampling achieved an accuracy of 91.78%, significantly outperforming the baseline model without feature selection and oversampling, which obtained only 64.93% accuracy.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Putu Widyantara Artanta Wibawa

Cokorda Rai Adi Pramartha

I Gusti Ngurah Anom Cahyadi Putra

Journals

SHILAP Revista de lepidopterología

Actions

Institutions

Udayana University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study