Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mendonça et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68de5d9c83cbc991d0a202d9 — DOI: https://doi.org/10.48550/arxiv.2505.22777
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
John Mendonça
Alon Lavie
Isabel Trancoso
Building similarity graph...
Analyzing shared references across papers
Loading...