Manual plan evaluation faces reliability and scalability challenges. This research benchmarks human evaluations against a large language model (LLM) using a multi-agent approach and/or retrieval-augmented generation (RAG) to automate complex content analysis tasks. We find that LLMs generally perform comparably with humans, with most errors arising from overimplication and limited domain knowledge. The multi-agent approach substantially enhances LLM’s performance, reducing common machine errors by over 50 percent. Integrating such LLM tools with human oversight will likely become the new norm for content analysis, and this study demonstrates how to leverage artificial intelligence’s (AI) efficiency and precision alongside humans’ contextual understanding and domain expertise.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xinyu Fu
Chaosu Li
Journal of Planning Education and Research
University of Hong Kong
Texas A&M University
Hong Kong University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Fu et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68d44b3f31b076d99fa55165 — DOI: https://doi.org/10.1177/0739456x251372082
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: