January 22, 2026Open Access

Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

This paper investigates the use of large language models (LLMs) as evaluators in multidimensional machine translation (MT) assessment, focusing on the English–Indonesian language pair. Building on established evaluation frameworks, we adopt an MQM-aligned rubric that assesses translation quality along morphosyntactic, semantic, and pragmatic dimensions. Three LLM-based translation systems (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) are evaluated using both expert human judgments and an LLM-based evaluator (GPT–5), allowing for a detailed comparison of alignment, bias, and consistency between human and AI-based assessments. In addition, a classroom calibration study is conducted to examine how rubric-guided evaluation supports alignment among novice evaluators. The results indicate that GPT–5 exhibits strong agreement with human evaluators in terms of relative quality ranking, while systematic differences in absolute scoring highlight calibration challenges. Overall, this study provides insights into the role of LLMs as reference-free evaluators for MT and illustrates how multidimensional rubrics can support both research-oriented evaluation and pedagogical applications in a mid-resource language setting.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Shalawati et al. (Thu,) studied this question.

synapsesocial.com/papers/6a0e96dda7f61df77cc8669c https://doi.org/https://doi.org/10.3390/digital6010008

Me gusta

Guardar

Ver artículo completo