What question did this study set out to answer?

The research aims to assess the effectiveness of a Generative AI chatbot in evaluating large-scale educational data against human ratings.

March 14, 2026Open Access

Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings

Key Points

The research aims to assess the effectiveness of a Generative AI chatbot in evaluating large-scale educational data against human ratings.
Developed a customized generative AI chatbot using retrieval-augmented generation (RAG) framework
Compared LLM-as-a-judge evaluations with human expert ratings based on correctness, completeness, and communication
Evaluated chatbot responses using a three-dimensional framework and computed interrater reliability using quadratic weighted kappa.
LLM-as-a-judge demonstrated comparable reliability to human ratings across evaluation dimensions
No significant differences in inter-human versus human-to-LLM agreement, except in communication quality
LLM-based evaluation offers a scalable and cost-effective alternative to human assessments.

Abstract

This study focuses on developing and evaluating a customized Generative AI chatbot designed to enhance access to large-scale educational data. The chatbot aims to assist researchers and policymakers in exploring complex datasets, such as NAEP, through natural language queries. The chatbot was built using a Retrieval-Augmented Generation (RAG) framework that integrates multiple specialized agents to retrieve, interpret, and synthesize educational data. One agent was selected as a case study for performance evaluation. The study compared an automated Large Language Model (LLM)-based evaluation (“LLM-as-a-judge”) with human expert ratings to examine validity and consistency across three criteria: correctness, completeness, and communication quality. A total of 141 expert-generated questions reflecting typical user queries were used, each accompanied by a reference answer and source documentation. Chatbot’s responses were evaluated with a three-dimensional framework on Correctness, Completeness, and Communication. In addition to human evaluation, an LLM-based evaluation was implemented, and the model was provided with the rubric, human-written reference answers, and retrieved RAG contents to generate automated quality assessments. Interrater reliability among human raters and the LLM-as-a-judge were computed with quadratic weighted kappa (QWK). Findings showed that the LLM-as-a-judge approach achieved comparable agreement levels with human raters and demonstrated reliability across all evaluation dimensions. Interrater reliability analyses revealed no significant differences between inter-human and human-to-LLM agreement, except in the communication dimension, where human-to-LLM consistency was higher. These results indicate that the LLM-as-a-judge method can serve as a viable and consistent alternative to human evaluation for customized RAG-based chatbot assessment. Integrating LLM-based evaluation into the assessment of Generative AI chatbots provides a scalable, reliable, and cost-effective complement to traditional human review. With human oversight for calibration and validation, this approach enables more efficient and consistent evaluation practices, advancing the use of AI tools that facilitate broader access to large-scale educational data.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Social Feed

Authors

Ting Zhang

Luke Patterson

Blue Webb

Journals

Large-scale Assessments in Education

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Social Feed

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider