To evaluate the diagnostic proficiency of well-established multimodal Large Language Models (LLMs)—specifically Gemini, Claude, and Copilot—in interpreting cervical cytopathology (PAP smears). This diagnostic accuracy study used 30 representative cases from the “Cytopathology of the Uterine Cervix—Digital Atlas,” which represented the gold standard. The models were tested using a standardized, zero-shot prompt to distinguish between normal cells, benign modifications (infections), and cervical cell abnormalities. The models demonstrated significant variability. While Gemini and Copilot showed high proficiency (90%) in identifying normal physiological morphology, performance declined sharply for infectious pathogens and dysplastic/neoplastic lesions. Notably, all models misclassified several invasive carcinomas as benign or low-grade lesions. Claude exhibited a high rate of “diagnostic escalation,” frequently misidentifying normal cells as having abnormalities. Current LLMs are inadequate for the definitive diagnosis of cervical dysplasia and malignancy due to significant rates of overdiagnosis and failure to detect invasive carcinomas. They should be viewed as emerging educational aids rather than autonomous diagnostic tools, requiring rigorous human oversight.
Psilopatis et al. (Wed,) studied this question.