What question did this study set out to answer?

To assess the efficacy of convolutional neural networks for classifying Pap smear images and understand model generalization.

March 15, 2026Open Access

Assessing the efficacy of convolutional neural networks for Pap smear classification: a real world analysis

Read Full Paperexternally

Key Points

To assess the efficacy of convolutional neural networks for classifying Pap smear images and understand model generalization.
Evaluated three CNN architectures (VGG16, ResNet50, InceptionV3) on four curated Pap smear datasets.
Used stratified 5-fold cross-validation to identify the best performing model per dataset.
Conducted external evaluations using a non-curated, real-world dataset.
All architectures performed robustly on curated benchmarks with Macro-F1 scores between 73.58% and 99.28%.
Significant performance drop on the Real-World dataset (Macro-F1: 33.25–55.91%), highlighting domain shift.
Models trained on diverse datasets showed improved inter-domain performance, especially for high-grade lesions.

Abstract

Background Undetected cervical lesions can progress to cancer, a leading cause of mortality among women worldwide. While automated analysis of Papanicolaou (Pap) smear images using convolutional neural networks (CNNs) has demonstrated significant potential for screening, most existing studies rely on single curated datasets. This aspect limits the understanding of model generalization to the noise and variability inherent in real-world clinical cytology. Methods We evaluated three CNN architectures (VGG16, ResNet50, and InceptionV3) across four curated Pap smear datasets using stratified 5-fold cross-validation. For each dataset, the model achieving the highest mean Macro-F1 score was selected for further analysis. To assess robustness against domain shift, we performed an external evaluation using a non-curated, Real-World dataset comprising routine clinical images. Results All architectures achieved robust performance on the curated benchmarks, with mean Macro-F1 scores ranging from 73.58% to 99.28%. However, performance dropped significantly when models were evaluated on the Real-World dataset (Macro-F1: 33.25–55.91%), highlighting the severity of the domain gap. Notably, the model trained on a combined heterogeneous dataset achieved the highest inter-domain performance, suggesting that data diversity improves robustness. Class-wise analysis revealed that high-grade lesions were most sensitive to real-world variability. Conclusions Although CNNs achieve state-of-the-art results on curated benchmarks, their direct applicability to routine cytology workflows is hindered by domain shift. Our findings emphasize that evaluating models across heterogeneous, multi-source datasets is a prerequisite for reliable clinical deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sidnir Carlos Baia Ferreira

Romário Silva

Carlos André de Mattos Teixeira

Journals

PeerJ Computer Science

Actions

Institutions

National Institute for Space Research

Universidade Federal do Pará

Instituto Federal de Educação, Ciência e Tecnologia do Pará

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Assessing the efficacy of convolutional neural networks for Pap smear classification: a real world analysis

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider