Transformer-based architectures have demonstrated significant promise in medical image segmentation due to their strong ability to model long-range contextual relationships. However, standard Vision Transformer (ViT) modules used in hybrid networks such as TransUNet are limited in representing both fine-grained and coarse features effectively. To overcome this limitation, this paper introduces Swin-UNet, a hybrid framework that combines the hierarchical Swin Transformer encoder with a U-Net-inspired decoder. The encoder utilizes shifted-window self-attention for efficient local-global feature learning, while the decoder integrates residual convolutional paths and multi-scale patch embeddings for improved reconstruction and scale robustness. Evaluated on the Synapse multi-organ CT dataset, the model achieves competitive Dice scores and lower Hausdorff distances compared to U-Net and TransUNet, highlighting its potential as a robust and generalizable approach for medical image segmentation. These results suggest that the Swin-UNet effectively balances computational efficiency with segmentation accuracy, offering a strong foundation for future medical imaging applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaodong Li (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c88e4eeef8a2a6b1bba — DOI: https://doi.org/10.1051/itmconf/20268401003/pdf
Xiaodong Li
Lanzhou University
Building similarity graph...
Analyzing shared references across papers
Loading...