What question did this study set out to answer?

This study examines the influence of patient race and ethnicity on the diagnostic accuracy of large language models (LLMs) in clinical neurology scenarios.

May 7, 2026Open Access

Evaluating AI performance in Clinical Neurology: The impact of Patient Race/Ethnicity on Large Language Model Accuracy

Key Points

This study examines the influence of patient race and ethnicity on the diagnostic accuracy of large language models (LLMs) in clinical neurology scenarios.
Utilized two LLM platforms, ChatGPT and Meta AI (Llama 3), on five neurology clinical vignettes.
Responses were assessed for diagnostic accuracy, detail, and potential bias by a board-certified neurologist and a board-certified internist.
Conducted a comparative analysis of primary diagnoses across different racial and ethnic groups.
Both platforms achieved a 97.5% diagnostic accuracy, with 39 out of 40 diagnoses identical.
In a specific case, only the Hispanic patient was misdiagnosed, indicating potential bias in responses.
ChatGPT-4 provided more extensive management information compared to the concise responses from Meta AI.

Abstract

Large Language Models (LLMs) are artificial intelligence systems that process and synthesize data to generate outputs that resemble human language. While LLMs are increasingly used in research, clinical care, and education, limited studies have examined the biases in the training datasets. When used for physician training and clinical diagnosis, such biases could limit the quality of the content produced. Previous studies have shown that LLMs perform well on neurology board-style questions. However, whether these results are reliable and unbiased enough to justify using LLMs in clinical settings remains unclear. This study aims to utilize LLMs to analyze clinical neurology scenarios and evaluate how race and ethnicity affect response variations. This mixed-method study examined the diagnostic performance of two LLM platforms, ChatGPT and Meta AI (Llama 3), using five neurology clinical vignettes. Each platform's responses were evaluated for diagnostic accuracy, detail, and potential bias. A Quadruple-board certified neurologist (ME) and a board-certified internist (BB) reviewed the cases to ensure clarity and evaluated the responses generated by the AI platforms. The diagnostic impressions from both platforms were similar, with 39 out of 40 diagnoses (97.5%) identical. However, in one case involving a patient with aphasia and gaze deviation who had suspected alcohol use, the LLM accurately diagnosed acute stroke for all patients except for the Hispanic patient, for whom it diagnosed Wernicke’s encephalopathy instead. ChatGPT-4 provided more comprehensive responses, including both the diagnosis and additional management information, while Meta AI offered concise and less detailed answers. Although the primary diagnoses were consistent across different racial and ethnic groups, there were noticeable differences in the range of differential diagnoses and management options. This raises concerns about potential bias in the training datasets. ChatGPT-4 and Meta AI possess the potential to address neurology clinical scenarios accurately; however, it is important to limit the negative impact of bias in the training datasets, which may contribute to increasing health disparities.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Manu et al. (Wed,) studied this question.

synapsesocial.com/papers/69fc2ba98b49bacb8b34794d https://doi.org/https://doi.org/10.1016/j.neuros.2026.100037

Bookmark

View Full Paper