What does this research mean for the field?

Integrating semantic clinical text with dermoscopic images using a text-guided cross-attentive multimodal learning architecture (TG-CAVNet) significantly improves the diagnostic accuracy and explainability of automated skin lesion detection. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To improve automated skin lesion detection by integrating clinical text with dermoscopic images using a multimodal approach.

April 10, 2026Open Access

Text guided cross attentive multimodal learning with visual feature modulation for automated skin lesion detection

Key Points

To improve automated skin lesion detection by integrating clinical text with dermoscopic images using a multimodal approach.
Developed a Text-Guided Cross-Attentive Visual Feature Network (TG-CAVNet).
Utilized Bio-ClinicalBERT for encoding clinical text and EfficientNet-B4 for visual feature extraction.
Employed text-guided feature modulation and cross-attention for merging data types.
Trained the model end-to-end with hybrid loss functions.
Achieved 90.75% accuracy on a large multimodal dermoscopic dataset.
Macro Jaccard score of 0.82 indicates high precision in segmentation.
Ablation studies confirmed the importance of each model component for effectiveness.

Abstract

Automated skin lesion detection is essential for early dermatological diagnosis. Most deep learning algorithms employ dermoscopic images and ignore the clinical context of dermatologists. This constraint decreases the robustness and interpretability, particularly in visually ambiguous instances. An explainable multimodal cross attention framework that merges clinical text with dermoscopic images to increase diagnostic accuracy and semantic grounding in automated skin lesion detection. A Text-Guided Cross-Attentive Visual Feature Network multimodal learning architecture (TG-CAVNet) using Bio-ClinicalBERT based clinical text encoding and EfficientNet-B4 visual feature extraction is proposed. The framework uses text-guided channel-wise feature modulation, text-queried cross-attention for semantic-spatial alignment and adaptive multi-stream fusion to merge complementary representations. The model was trained end-to-end using hybrid focal and cross-entropy losses. On a multimodal dermoscopic dataset of 6194 aligned image-text samples, TG-CAVNet outperforms state-of-the-art multimodal baselines with 90.75% accuracy and a macro Jaccard score of 0.82. Ablation investigations are performed that confirmed the separate and synergistic impacts of the components, whereas attention visualizations improved interpretability. Text-guided cross-attentive multimodal learning improves the performance and explainability of automated skin lesion identification. The robust and clinically interpretable decision-support framework TG-CAVNet demonstrates the necessity of integrating semantic clinical context with visual analysis in dermatological AI systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

P. Suresh

P. Keerthika

A. R. Nitesh Kumar

Journals

Scientific Reports

Actions

Institutions

Vellore Institute of Technology University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Text guided cross attentive multimodal learning with visual feature modulation for automated skin lesion detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study