Recently, many methods have been proposed for automatic evaluation of dialogue systems, which show highcorrelation with human evaluations. Although these methods tend to align well with the average scores of multipleevaluators, the scores may not reflect individual preferences. In this study, we propose an automatic evaluationmethod that incorporates the evaluation tendencies of specific individuals, such as system designers or specific users,in order to realize evaluations that align with individual preferences, rather than average-based evaluations. Wefirst focus on the differences in the aspects that each evaluator emphasizes in dialogue evaluation and computesweights to each sub-metric accordingly. Then, based on the obtained weights, we estimate an overall score foreach dialogue system using the scores for each sub-metric produced by automatic evaluation. Through experimentsinvolving multiple evaluators, we confirmed that our method can produce system evaluations that reflect individualevaluation tendencies. In this process, we utilized a Large Language Model (LLM) for the automatic evaluation andapplied multiple regression analysis to determine the metric weights. The results show that, compared to evaluationby the LLM alone, incorporating individual regression-based weights leads to a reduction in the mean squared errorof the overall score, making it closer to each evaluator’s actual scores.
Building similarity graph...
Analyzing shared references across papers
Loading...
Keisuke Kameyama
Kazunori Komatani
Osaka Research Institute of Industrial Science and Technology
Transactions of the Japanese Society for Artificial Intelligence
Osaka Research Institute of Industrial Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Kameyama et al. (Sat,) studied this question.
synapsesocial.com/papers/69a67eb2f353c071a6f0a09c — DOI: https://doi.org/10.1527/tjsai.41-2_ids26-c