Accurate and interpretable cancer classification in histopathological images remains a significant challenge due to the complex structural variations in tissue samples. In this paper, we propose MDeiT, a lightweight and interpretable sequential hybrid model that effectively integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance both classification accuracy and efficiency. Unlike traditional ensemble-based hybrid models, our framework adopts a streamlined design, leveraging MobileNetV2 and DeiT Tiny as backbone architectures, with an adaptation layer facilitating the transition from CNN-extracted local features to Transformer processing. To improve interpretability, we incorporate Gradient-weighted Class Activation Mapping (Grad-CAM) for visual explanations of model predictions. Furthermore, we introduce expert-driven qualitative validation, where pathologists annotate ground truth to systematically assess the alignment between model-generated saliency maps and clinically relevant diagnostic regions, establishing a high-quality benchmark for interpretability evaluation. Extensive experiments on skin and lung cancer datasets demonstrate that MDeiT consistently outperforms state-of-the-art models across multiple metrics while maintaining computational efficiency. The results demonstrate its effectiveness in capturing both fine-grained tissue details and broader contextual patterns, making it a robust and scalable solution for real-world histopathological image analysis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Getamesay Haile Dagnaw
Yanming Zhu
Yuhong Wang
Biomedical Signal Processing and Control
Griffith University
Soochow University
First Affiliated Hospital of Soochow University
Building similarity graph...
Analyzing shared references across papers
Loading...
Dagnaw et al. (Sun,) studied this question.
www.synapsesocial.com/papers/6a0d4e9df03e14405aa99dac — DOI: https://doi.org/10.1016/j.bspc.2026.110620