Key points are not available for this paper at this time.
Large language models (LLMs) such as ChatGPT have demonstrated potential for interpretation in various scientific disciplines; however, their application in forensic toxicology remains unexamined. We wanted to investigate the performance of LLMs compared to experts at interpreting drug concentrations in body fluids as a triage tool. In this preliminary study, the results from 10 anonymised forensic toxicology cases from published sources were submitted as prompts to Microsoft 365 CoPilot and ChatGPT version 3.5. AI-Generated outputs were assessed against the published expert interpretations for accuracy in drug identification, risk categorisation (fatal, life threatening, severe, etc.), caveats to the interpretation (e.g. post-mortem redistribution), and expression of confidence (suggests, strongly suggests, etc.). LLMs correctly identified 93% of 15 substances across the cases, but in 70% of cases used markedly overconfident language. Differences in the number of caveats to the interpretation given were observed. For experts, the number of caveats ranged from 0 to 5 per case (mean = 2.2) compared to 0 to 5 for Copilot (mean = 1.3) and 1 to 7 for ChatGPT (mean = 3.9). While results indicate that LLMs may assist in early triage under supervision, their use in evidential contexts is not currently supportable due to errors in drug identification, inappropriate use of language, a lack of nuance in interpretation of drug concentrations, and an inconsistent approach to caveats.
Riga et al. (Wed,) studied this question.