What does this research mean for the field?

The hybrid DWS-Swin Transformer achieves 99.15% accuracy in underwater acoustic target recognition, improving 8.59% over traditional depth-wise separable CNN methods. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to enhance underwater acoustic target recognition using a novel hybrid model that combines DWS CNN and Swin Transformer techniques.

March 16, 2026Open Access

MSW-MSA Net: A Hybrid Depthwise Separable CNN and Multi-Stage Swin Transformer for Underwater Acoustic Target Recognition

Key Points

The study aims to enhance underwater acoustic target recognition using a novel hybrid model that combines DWS CNN and Swin Transformer techniques.
Developed a hybrid model integrating depth-wise separable CNN and MSW-MSA Swin Transformer.
Implemented multi-stage shifted window attention for effective feature extraction.
Used Grad-CAM for interpreting model focus on key spectral regions.
Achieved 99.15% accuracy on the ShipsEar dataset, improving by 8.59% over previous DWS models.
Reduced computational cost by 66.3% compared to standard Vision Transformer models.

Abstract

• A hybrid DWS-Swin Transformer for efficient underwater acoustic target recognition • Introduces multi-stage shifted window attention to model cross-window dependencies • Achieves 99.15% accuracy, improving 8.59% over DWS using the Vision Transformer model • Reduces computational cost by 66.3% compared to the standard Vision Transformer model • Grad-CAM shows interpretable focus on key spectral regions, confirming classification validity Underwater acoustic target recognition is more complex than conventional recognition because of the intricate underwater environment. Recently, Vision Transformers (ViTs) have achieved state-of-the-art performance in these tasks, but they face limitations that hinder widespread adoption. Vision Transformers rely on global self-attention, which results in quadratic computational complexity and limits efficiency with high-resolution inputs. Their dependence on large-scale datasets also restricts their use in areas with limited annotated data. To address these challenges, a novel approach combines a depth-wise separable convolutional neural network with the MSW-MSA Swin Transformer model. The method extracts a concise spatial feature representation using a depth-wise separable convolutional neural network and increases the number of data samples. The MSW-MSA Swin Transformer uses hierarchical window-based attention with a two-stage shifted window-based attention mechanism. The proposed method reduces computational overhead while maintaining the ability to capture both local details and global dependencies. The proposed method achieved a recognition accuracy of 99.15% on the ShipsEar dataset and resulted in a 66.3% reduction in computational cost compared to ViT.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper