What type of study is this?

September 5, 2025Open Access

GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models

Key Points

Accuracy in evaluating Arabic large language models remains low, with the best performing model achieving only 66.9%.
The introduction of GATmath and GATLc marks significant progress in developing comprehensive datasets for Arabic language evaluation.
These benchmarks cover a range of skills including reasoning, semantic analysis, and language comprehension, highlighting their importance.
Current Arabic LLMs face considerable challenges, indicating the need for ongoing improvements in model development.

Abstract

The evolution of Large Language Models (LLMs) has significantly advanced artificial intelligence, driving innovation across various applications. Their continued development relies on a deep understanding of their capabilities and limitations. This is achieved primarily through rigorous evaluation based on diverse datasets. However, assessing state-of-the-art models in Arabic remains a formidable challenge due to the scarcity of comprehensive benchmarks. The absence of robust evaluation tools hinders the progress and refinement of Arabic LLMs and limits their potential applications and effectiveness in real-world scenarios. In response, we introduce the GATmath (7k questions) and GATLc (9k questions), two Arabic, large-scale, and multitask reasoning and language understanding benchmarks. Derived from the General Aptitude Test (GAT) examination, each dataset covers multiple categories, demanding skills in reasoning, semantic analysis, language comprehension, and mathematical problem-solving. To the best of our knowledge, our dataset is the first comprehensive and large-scale reasoning dataset specifically tailored to the Arabic language. We conducted a comprehensive evaluation and analysis of seven prominent LLMs on our datasets. Remarkably, even the highest-performing model attained a mere 66.9% and 64.3% accuracy, underscoring the considerable challenge posed by our datasets. This outcome illustrates the intricate nature of the tasks within our datasets and highlights the substantial room for improvement in the realm of Arabic language model development.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Saleh R. Al-Ballaa

Nora Al-Twairesh

AbdulMalik S. Al‐Salman

Journals

PLoS ONE

Actions

Institutions

King Saud University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study