Oral cancer, particularly Oral Squamous Cell Carcinoma (OSCC), remains a major global health concern due to its high prevalence, late diagnosis, and limited prognostic precision in conventional histopathological evaluation. Although deep learning has been showing promising results in automated cancer classification, most current models, especially CNN-based architectures, generally lack interpretability, generalization capability, and prognostic insight, hence limiting their clinical applicability. To address these shortcomings, this work introduces an Explainable Vision Transformer framework (MMX-ViT) for multi-class classification and prognostic interpretation of oral cancer in histopathology images. The proposed model fuses convolutional feature extraction with transformer-based global attention using an Adaptive Cross-Fusion Module (ACFM), allowing efficient multi-scale learning of cellular and tissue-level features. The MMX-ViT model was trained and evaluated on a publicly available oral cancer histopathology dataset, extended in this study into four diagnostic categories, and compared with eight state-of-the-art architectures. It reached a high classification performance of 98.45%, with an AUC of 0.99, thus surpassing all the baseline methods. Explainability analysis based on Grad-CAM + + , SHAP, and Transformer Attention Rollout techniques demonstrated that biologically relevant areas of attention were identified by the model, such as dysplastic nuclei, keratin pearls, and invasion zones in stroma, with an XCI (Explainability Consistency Index) value of 94%. The model proposed here represents a major progress towards the establishment of reliable and interpretable AI-based diagnosis of oral cancer.
Mahanty et al. (Thu,) studied this question.