Transformer-based architectures have demonstrated significant promise in medical image segmentation due to their strong ability to model long-range contextual relationships. However, standard Vision Transformer (ViT) modules used in hybrid networks such as TransUNet are limited in representing both fine-grained and coarse features effectively. To overcome this limitation, this paper introduces Swin-UNet, a hybrid framework that combines the hierarchical Swin Transformer encoder with a U-Net-inspired decoder. The encoder utilizes shifted-window self-attention for efficient local-global feature learning, while the decoder integrates residual convolutional paths and multi-scale patch embeddings for improved reconstruction and scale robustness. Evaluated on the Synapse multi-organ CT dataset, the model achieves competitive Dice scores and lower Hausdorff distances compared to U-Net and TransUNet, highlighting its potential as a robust and generalizable approach for medical image segmentation. These results suggest that the Swin-UNet effectively balances computational efficiency with segmentation accuracy, offering a strong foundation for future medical imaging applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaodong Li
Lanzhou University
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaodong Li (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c88e4eeef8a2a6b1bba — DOI: https://doi.org/10.1051/itmconf/20268401003/pdf