July 4, 2024Open Access

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Key Points

Key points are not available for this paper at this time.

Abstract

Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Mendonça et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e616ccb6db6435875a9979 — DOI: https://doi.org/10.48550/arxiv.2407.03841

Authors

John Mendonça

Alon Lavie

Isabel Trancoso

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion