Key points are not available for this paper at this time.
Abstract Data extraction in systematic reviews, maps, and meta-analyses is time-consuming and prone to human error or subjective judgment. Large Language Models offer the potential for saving time, yet their performance has been evaluated in a limited range of platforms, disciplines, and review types. We assessed the performance of the Elicit platform across diverse data extraction tasks using journal articles from seven systematic reviews in life and environmental sciences. Human-extracted data served as the gold standard. For each review, we used eight articles for prompt development and another eight for testing. Initial prompts were iteratively refined to exceed 87% accuracy or up to five rounds. We then tested extraction accuracy, reproducibility across user accounts, and the effect of Elicit's high-accuracy mode. Of 90 considered prompts, 70 exceeded the 87% accuracy when compared to gold standard, but tended to be lower when tested on a new set of articles. Repeating data extractions with different Elicit user accounts resulted in 90% agreement on extracted values, though supporting quotes and reasoning matched in only 46% and 30% of cases, respectively. In high-accuracy mode, value matches dropped to 77%, with just 10% quote matches and 0% reasoning matches. Extraction accuracy did not differ by data types. Elicit also helped identify eight (<1%) errors in the gold standard data. Our results show that Elicit can complement, but not replace, human data extractors. Elicit may be best used for sanity checks and to evaluate the clarity of data extraction protocols. Prompts must be fine-tuned and independently validated.
Lagisz et al. (Fri,) studied this question.