While deep encoder-decoder models dominate endoscopic segmentation, their reliance on full fine-tuning or training from scratch is computationally expensive and data-intensive. This paper challenges this by demonstrating that an extremely efficient model—a frozen foundation model encoder with a shallow decoder—can achieve state-of-the-art performance. Our core contribution is a systematic, layer-wise analysis to identify the single most effective feature source within the encoder’s hierarchy, challenging the common practice of using the final, most abstract layer. We identify a distinct performance peak at an intermediate layer (Layer 12), with an optimal trade-off between high-level semantic understanding and the high-resolution spatial fidelity crucial for segmentation. Despite training only 650k parameters, our method surpasses existing benchmarks on the challenging multi-center PolypGen dataset with a Dice score of 0.972. This work provides an evidence-based methodology for efficient feature extraction, significantly lowering the computational and data barriers for developing high-performance clinical AI tools.
Taha et al. (Mon,) studied this question.