What question did this study set out to answer?

The central aim is to enhance Kazakh sequence labeling despite limited annotated resources and high data sparsity.

April 10, 2026Open Access

LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models

Key Points

The central aim is to enhance Kazakh sequence labeling despite limited annotated resources and high data sparsity.
Developed a weak supervision framework utilizing a large language model for synthetic annotation generation.
Filtered synthetic annotations using confidence-based criteria combined with manually verified data.
Trained Transformer-based sequence taggers integrated with Conditional Random Field (CRF) decoding.
Constructed a reproducible pipeline integrating corpus construction, quality filtering, and alignment.
Strong performance was achieved in part-of-speech tagging (POS) and named entity recognition (NER) for Kazakh.
The training design facilitated efficient convergence with diminishing returns after a moderate number of epochs.
Residual errors were primarily observed in complex tokens, longer sentences, and the ORG entity class.

Abstract

Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER). This paper proposes an LLM-assisted weak supervision framework in which a large language model generates synthetic token-level annotations that are subsequently filtered using confidence-based criteria and combined with a smaller manually verified subset to train Transformer-based sequence taggers with Conditional Random Field (CRF) decoding. The pipeline unifies corpus construction, weak-label generation, quality filtering, word-to-subword alignment, and CRF-refined structured prediction into a reproducible workflow. Experimental results show that contextual encoders and structured decoding provide strong performance for Kazakh POS and NER, while the proposed training design enables efficient convergence with diminishing returns beyond moderate epoch budgets. Error-slice analysis indicates that residual errors are concentrated in rare tokens, morphologically complex long words, longer sentences, and the ORG entity class. Overall, the findings support the use of LLM-assisted weak supervision as a scalable strategy for low-resource Kazakh sequence labeling when synthetic labels are controlled through filtering and refined by structured decoding.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Aigerim Aitim

Journals

Applied Sciences

Actions

Institutions

International Information Technologies University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study