What type of study is this?

This is a Quantitative Study study.

October 2, 2025Open Access

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

Key Points

Multilingual benchmarking highlights performance gaps in chatbots, showcasing the limitations of current LLM evaluations.
Using the MEDAL framework, notable shortcomings in state-of-the-art chatbots were identified, especially in empathy and relevance.
The methodology employs multilingual dialogues generated by various LLMs, facilitating a comprehensive meta-evaluation of chatbot performance.
The findings emphasize the need for advanced evaluation tools in assessing dialog quality, pushing for richer benchmarks.

Abstract

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Mendonça et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68de5d9c83cbc991d0a202d9 — DOI: https://doi.org/10.48550/arxiv.2505.22777

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation· 2024 · 1 citations
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators· 2024 · 11 citations
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments· 2025
METAL: Towards Multilingual Meta-Evaluation· 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs

Authors

John Mendonça

Alon Lavie

Isabel Trancoso

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion