What question did this study set out to answer?

May 16, 2026Open Access

Evaluating creative work with artificial intelligence: Evidence from constrained innovation tasks

Key Points

This research aims to determine if a large language model can reliably evaluate human creativity in constrained tasks typical of creative industries.
Conducted a controlled experiment using expert-generated creative outputs.
Employed a large language model as an evaluator and compared its judgments with expert assessments.
Analyzed results based on internal consistency, evaluative variability, and task-specific dimensions.
AI evaluations demonstrated high inter-rater reliability consistent with expert judgments.
AI outputs showed lower variability and systematically higher scores compared to human judges.
AI evaluations were structured along key creative dimensions: fluency, flexibility, originality, and elaboration.

Abstract

We study whether a large language model can reliably evaluate human creativity in constrained, innovation-like tasks. Using expert-generated creative outputs from a validated experiment with workers in cultural and creative industries, we embed ChatGPT as an evaluator and benchmark its assessments against expert human judgments obtained through the Consensual Assessment Technique. Study 1 supports AI reliability by showing that AI-based creativity evaluations exhibit internal consistency comparable to that of expert judges across repeated and independent runs, even under conservative scenarios. Replacing a human judge with an AI evaluator does not reduce inter-rater reliability across drawing, mathematical, and verbal tasks. Beyond reliability, AI evaluations display three additional features that are difficult to achieve with human-only panels: lower evaluative variability, systematically higher scores consistent with a potentially more inclusive evaluative stance, and task-independence of evaluative standards. Study 2 further supports task-independence by showing that AI evaluations are structured along fluency, flexibility, originality, and elaboration, with dimension weights that adapt to task-specific constraints. • We test AI evaluation of human creativity on outputs from a controlled experiment. • We study constrained, innovation-like creative tasks. • Replacing one human judge with AI preserves panel reliability. • AI scores are less dispersed, higher on average, and task-independent. • AI evaluation is structured by fluency, flexibility, originality, and elaboration.

Evaluating creative work with artificial intelligence: Evidence from constrained innovation tasks

Key Points

Abstract

Cite This Study