Key points are not available for this paper at this time.
Abstract It is likely that individuals are turning to Large Language Models (LLMs) to seek health advice, much like searching for diagnoses on Google. We evaluate clinical accuracy of GPT-3·5 and GPT-4 for suggesting initial diagnosis, examination steps and treatment of 110 medical cases across diverse clinical disciplines. Moreover, two model configurations of the Llama 2 open source LLMs are assessed in a sub-study. For benchmarking the diagnostic task, we conduct a naïve Google search for comparison. Overall, GPT-4 performed best with superior performances over GPT-3·5 considering diagnosis and examination and superior performance over Google for diagnosis. Except for treatment, better performance on frequent vs rare diseases is evident for all three approaches. The sub-study indicates slightly lower performances for Llama models. In conclusion, the commercial LLMs show growing potential for medical question answering in two successive major releases. However, some weaknesses underscore the need for robust and regulated AI models in health care. Open source LLMs can be a viable option to address specific needs regarding data privacy and transparency of training.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sandmann et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e75683b6db6435876ce2ed — DOI: https://doi.org/10.1038/s41467-024-46411-8
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Sarah Sandmann
Sarah Riepenhausen
Lucas Plagwitz
Nature Communications
University of Münster
Building similarity graph...
Analyzing shared references across papers
Loading...