What type of study is this?

September 10, 2025Open Access

Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator

Key Points

The proposed framework diagnoses bias and instability in large language model evaluations.
Over 3600 pairwise comparisons showed a top-performing model with a 66.5% win rate.
Kendall’s Tau analysis indicated substantial local ranking variation influenced by semantic context.
The findings highlight the importance of structured and resilient evaluation methods for large language models.

Abstract

The evaluation of large language models (LLMs) increasingly relies on other LLMs acting as automated judges. While this approach offers scalability and efficiency, it raises serious concerns regarding evaluator reliability, positional bias, and ranking stability. This paper presents a scalable framework for diagnosing positional bias and instability in LLM-based evaluation by using controlled pairwise comparisons judged by multiple independent language models. The system supports mirrored comparisons with reversed response order, prompt injection, and surface-level perturbations (e.g., paraphrasing, lexical noise), enabling fine-grained analysis of evaluator consistency and verdict robustness. Over 3600 pairwise comparisons were conducted across five instruction-tuned open-weight models using ten open-ended prompts. The top-performing model (gemma:7b-instruct) achieved a 66.5% win rate. Evaluator agreement was uniformly high, with 100% consistency across judges, yet 48.4% of verdicts reversed under mirrored response order, indicating strong positional bias. Kendall’s Tau analysis further showed that local model rankings varied substantially across prompts, suggesting that semantic context influences evaluator judgment. All evaluation traces were stored in a graph database (Neo4j), enabling structured querying and longitudinal analysis. The proposed framework provides not only a diagnostic lens for benchmarking models but also a blueprint for fairer and more interpretable LLM-based evaluation. These findings underscore the need for structure-aware, perturbation-resilient evaluation pipelines when benchmarking LLMs. The proposed framework offers a reproducible path for diagnosing evaluator bias and ranking instability in open-ended language tasks. Future work will apply this methodology to educational assessment tasks, using rubric-based scoring and graph-based traceability to evaluate student responses in technical domains.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Cătălin Anghel

Andreea Alexandra Anghel

Emilia Pecheanu

Journals

Information

Actions

Institutions

"Dunarea de Jos" University of Galati

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider