Los puntos clave no están disponibles para este artículo en este momento.
This paper investigates the use of large language models (LLMs) as evaluators in multidimensional machine translation (MT) assessment, focusing on the English–Indonesian language pair. Building on established evaluation frameworks, we adopt an MQM-aligned rubric that assesses translation quality along morphosyntactic, semantic, and pragmatic dimensions. Three LLM-based translation systems (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) are evaluated using both expert human judgments and an LLM-based evaluator (GPT–5), allowing for a detailed comparison of alignment, bias, and consistency between human and AI-based assessments. In addition, a classroom calibration study is conducted to examine how rubric-guided evaluation supports alignment among novice evaluators. The results indicate that GPT–5 exhibits strong agreement with human evaluators in terms of relative quality ranking, while systematic differences in absolute scoring highlight calibration challenges. Overall, this study provides insights into the role of LLMs as reference-free evaluators for MT and illustrates how multidimensional rubrics can support both research-oriented evaluation and pedagogical applications in a mid-resource language setting.
Shalawati et al. (Thu,) studied this question.