What question did this study set out to answer?

The aim is to develop an explainable deep learning model for accurately classifying oral cancer in histopathology images while providing prognostic insights.

April 4, 2026Open Access

Explainable vision transformer framework for multi-class classification and prognostic interpretation of oral cancer in histopathology images

Puntos clave

The aim is to develop an explainable deep learning model for accurately classifying oral cancer in histopathology images while providing prognostic insights.
Introduced an explainable vision transformer framework (MMX-ViT) for image classification.
Fused convolutional feature extraction with transformer-based global attention using an Adaptive Cross-Fusion Module (ACFM).
Trained and evaluated on a publicly available oral cancer histopathology dataset with four diagnostic categories.
Compared performance with eight state-of-the-art architectures.
Achieved a classification performance of 98.45% and an AUC of 0.99, outperforming baseline methods.
Identified biologically relevant features such as dysplastic nuclei and keratin pearls through explainability analysis.
Returned an Explainability Consistency Index (XCI) value of 94%, indicating reliable interpretability.

Resumen

Oral cancer, particularly Oral Squamous Cell Carcinoma (OSCC), remains a major global health concern due to its high prevalence, late diagnosis, and limited prognostic precision in conventional histopathological evaluation. Although deep learning has been showing promising results in automated cancer classification, most current models, especially CNN-based architectures, generally lack interpretability, generalization capability, and prognostic insight, hence limiting their clinical applicability. To address these shortcomings, this work introduces an Explainable Vision Transformer framework (MMX-ViT) for multi-class classification and prognostic interpretation of oral cancer in histopathology images. The proposed model fuses convolutional feature extraction with transformer-based global attention using an Adaptive Cross-Fusion Module (ACFM), allowing efficient multi-scale learning of cellular and tissue-level features. The MMX-ViT model was trained and evaluated on a publicly available oral cancer histopathology dataset, extended in this study into four diagnostic categories, and compared with eight state-of-the-art architectures. It reached a high classification performance of 98.45%, with an AUC of 0.99, thus surpassing all the baseline methods. Explainability analysis based on Grad-CAM + + , SHAP, and Transformer Attention Rollout techniques demonstrated that biologically relevant areas of attention were identified by the model, such as dysplastic nuclei, keratin pearls, and invasion zones in stroma, with an XCI (Explainability Consistency Index) value of 94%. The model proposed here represents a major progress towards the establishment of reliable and interpretable AI-based diagnosis of oral cancer.

Me gusta

Guardar

Ver artículo completo