• A hybrid DWS-Swin Transformer for efficient underwater acoustic target recognition • Introduces multi-stage shifted window attention to model cross-window dependencies • Achieves 99.15% accuracy, improving 8.59% over DWS using the Vision Transformer model • Reduces computational cost by 66.3% compared to the standard Vision Transformer model • Grad-CAM shows interpretable focus on key spectral regions, confirming classification validity Underwater acoustic target recognition is more complex than conventional recognition because of the intricate underwater environment. Recently, Vision Transformers (ViTs) have achieved state-of-the-art performance in these tasks, but they face limitations that hinder widespread adoption. Vision Transformers rely on global self-attention, which results in quadratic computational complexity and limits efficiency with high-resolution inputs. Their dependence on large-scale datasets also restricts their use in areas with limited annotated data. To address these challenges, a novel approach combines a depth-wise separable convolutional neural network with the MSW-MSA Swin Transformer model. The method extracts a concise spatial feature representation using a depth-wise separable convolutional neural network and increases the number of data samples. The MSW-MSA Swin Transformer uses hierarchical window-based attention with a two-stage shifted window-based attention mechanism. The proposed method reduces computational overhead while maintaining the ability to capture both local details and global dependencies. The proposed method achieved a recognition accuracy of 99.15% on the ShipsEar dataset and resulted in a 66.3% reduction in computational cost compared to ViT.
Ghate et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: