What question did this study set out to answer?

The aim is to create a robust dataset and model for detecting cyber violence in Saudi dialects.

January 14, 2026Open Access

SD-CVD Corpus: Towards Robust Detection of Fine-Grained Cyber-Violence Across Saudi Dialects in Online Platforms

Key Points

The aim is to create a robust dataset and model for detecting cyber violence in Saudi dialects.
Developed a balanced corpus for cyber violence detection with 88,687 tweets.
Employed data augmentation to address class imbalance.
Evaluated performance using machine learning and deep learning methods.
Transformers outperformed other models, with AraBERTv02-Twitter achieving 0.882 weighted F1-score.
Among traditional methods, logistic regression and random forest showed competitive performance.

Abstract

This paper introduces Saudi Dialects Cyber Violence Detection (SD-CVD) corpus, a large-scale, class-balanced Saudi-dialect corpus for fine-grained cyber violence detection on online platforms. The dataset contains 88,687 Saudi Arabic tweets annotated using a three-level hierarchical scheme that assigns each tweet to one of 11 mutually exclusive classes, covering benign sentiment (positive, neutral, negative), cyberbullying, and seven hate-speech subtypes (incitement to violence, gender, national, social class, tribal, religious, and regional discrimination). To mitigate the class imbalance common in Arabic cyber violence datasets, data augmentation was applied to achieve a near-uniform class distribution. Annotation quality was ensured through multi-stage review, yielding excellent inter-annotator agreement (Fleiss’ κ > 0.89). We evaluate three modeling paradigms: traditional machine learning with TF–IDF and n-gram features (SVM, logistic regression, random forest), deep learning models trained on fixed sentence embeddings (LSTM, RNN, MLP, CNN), and fine-tuned transformer models (AraBERTv02-Twitter, CAMeLBERT-MSA). Experimental results show that transformers perform best, with AraBERTv02-Twitter achieving the highest weighted F1-score (0.882) followed by CAMeLBERT-MSA (0.869). Among non-transformer baselines, SVM is most competitive (0.853), while CNN performs worst (0.561). Overall, SD-CVD provides a high-quality benchmark and strong baselines to support future research on robust and interpretable Arabic cyber-violence detection.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Alsayed et al. (Mon,) studied this question.

www.synapsesocial.com/papers/6966f31513bf7a6f02c00a65 — DOI: https://doi.org/10.3390/info17010076

Authors

Abrar Alsayed

Salma Elhag

Sahar Badri

Journals

Information

Actions

Institutions

King Abdulaziz University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SD-CVD Corpus: Towards Robust Detection of Fine-Grained Cyber-Violence Across Saudi Dialects in Online Platforms

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion