Large Language Models (LLMs) are artificial intelligence systems that process and synthesize data to generate outputs that resemble human language. While LLMs are increasingly used in research, clinical care, and education, limited studies have examined the biases in the training datasets. When used for physician training and clinical diagnosis, such biases could limit the quality of the content produced. Previous studies have shown that LLMs perform well on neurology board-style questions. However, whether these results are reliable and unbiased enough to justify using LLMs in clinical settings remains unclear. This study aims to utilize LLMs to analyze clinical neurology scenarios and evaluate how race and ethnicity affect response variations. This mixed-method study examined the diagnostic performance of two LLM platforms, ChatGPT and Meta AI (Llama 3), using five neurology clinical vignettes. Each platform's responses were evaluated for diagnostic accuracy, detail, and potential bias. A Quadruple-board certified neurologist (ME) and a board-certified internist (BB) reviewed the cases to ensure clarity and evaluated the responses generated by the AI platforms. The diagnostic impressions from both platforms were similar, with 39 out of 40 diagnoses (97.5%) identical. However, in one case involving a patient with aphasia and gaze deviation who had suspected alcohol use, the LLM accurately diagnosed acute stroke for all patients except for the Hispanic patient, for whom it diagnosed Wernicke’s encephalopathy instead. ChatGPT-4 provided more comprehensive responses, including both the diagnosis and additional management information, while Meta AI offered concise and less detailed answers. Although the primary diagnoses were consistent across different racial and ethnic groups, there were noticeable differences in the range of differential diagnoses and management options. This raises concerns about potential bias in the training datasets. ChatGPT-4 and Meta AI possess the potential to address neurology clinical scenarios accurately; however, it is important to limit the negative impact of bias in the training datasets, which may contribute to increasing health disparities.
Manu et al. (Wed,) studied this question.