What question did this study set out to answer?

This research assesses how well large language models interpret cervical cytopathology using standard cases.

May 8, 2026Open Access

Diagnostic Proficiency of Large Language Models in the Interpretation of Cervical Cytopathology

Key Points

This research assesses how well large language models interpret cervical cytopathology using standard cases.
Evaluated three large language models: Gemini, Claude, and Copilot.
Utilized 30 representative cases from a digital atlas as the gold standard.
Applied a zero-shot prompt to assess models' abilities to identify normal cells, benign modifications, and cervical abnormalities.
Gemini and Copilot achieved 90% accuracy in identifying normal cells but struggled with infectious and neoplastic lesions.
All models misclassified several invasive carcinomas as benign or low-grade lesions.
Claude showed a high rate of misidentifying normal cells as abnormal, indicating diagnostic escalation.

Abstract

To evaluate the diagnostic proficiency of well-established multimodal Large Language Models (LLMs)—specifically Gemini, Claude, and Copilot—in interpreting cervical cytopathology (PAP smears). This diagnostic accuracy study used 30 representative cases from the “Cytopathology of the Uterine Cervix—Digital Atlas,” which represented the gold standard. The models were tested using a standardized, zero-shot prompt to distinguish between normal cells, benign modifications (infections), and cervical cell abnormalities. The models demonstrated significant variability. While Gemini and Copilot showed high proficiency (90%) in identifying normal physiological morphology, performance declined sharply for infectious pathogens and dysplastic/neoplastic lesions. Notably, all models misclassified several invasive carcinomas as benign or low-grade lesions. Claude exhibited a high rate of “diagnostic escalation,” frequently misidentifying normal cells as having abnormalities. Current LLMs are inadequate for the definitive diagnosis of cervical dysplasia and malignancy due to significant rates of overdiagnosis and failure to detect invasive carcinomas. They should be viewed as emerging educational aids rather than autonomous diagnostic tools, requiring rigorous human oversight.

Diagnostic Proficiency of Large Language Models in the Interpretation of Cervical Cytopathology

Key Points

Abstract

Cite This Study