What does this research mean for the field?

An automatic evaluation method for dialogue systems that incorporates individual evaluator tendencies produces scores that better reflect personal preferences than average-based evaluations. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

The aim is to create an evaluation method for dialogue systems that reflects individual evaluators' preferences instead of relying on average scores.

March 3, 2026Open Access

Proposal of an Automatic Evaluation Method for Dialogue System Reflecting Individual Tendencies

Key Points

The aim is to create an evaluation method for dialogue systems that reflects individual evaluators' preferences instead of relying on average scores.
Utilized a Large Language Model (LLM) for automatic evaluation of dialogue systems
Identified and weighted sub-metrics based on individual evaluator tendencies
Applied multiple regression analysis to determine these metric weights
Conducted experiments with multiple evaluators to validate the method
Demonstrated that the new method reduces mean squared error compared to using the LLM alone
Showed evaluations aligned more closely with individual evaluators' actual scores
Highlighted the importance of factoring in individual preferences for more accurate assessments

Abstract

Recently, many methods have been proposed for automatic evaluation of dialogue systems, which show highcorrelation with human evaluations. Although these methods tend to align well with the average scores of multipleevaluators, the scores may not reflect individual preferences. In this study, we propose an automatic evaluationmethod that incorporates the evaluation tendencies of specific individuals, such as system designers or specific users,in order to realize evaluations that align with individual preferences, rather than average-based evaluations. Wefirst focus on the differences in the aspects that each evaluator emphasizes in dialogue evaluation and computesweights to each sub-metric accordingly. Then, based on the obtained weights, we estimate an overall score foreach dialogue system using the scores for each sub-metric produced by automatic evaluation. Through experimentsinvolving multiple evaluators, we confirmed that our method can produce system evaluations that reflect individualevaluation tendencies. In this process, we utilized a Large Language Model (LLM) for the automatic evaluation andapplied multiple regression analysis to determine the metric weights. The results show that, compared to evaluationby the LLM alone, incorporating individual regression-based weights leads to a reduction in the mean squared errorof the overall score, making it closer to each evaluator’s actual scores.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Keisuke Kameyama

Kazunori Komatani

Osaka Research Institute of Industrial Science and Technology

Journals

Transactions of the Japanese Society for Artificial Intelligence

Actions

Institutions

Osaka Research Institute of Industrial Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Proposal of an Automatic Evaluation Method for Dialogue System Reflecting Individual Tendencies

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study