What question did this study set out to answer?

March 30, 2026

Prompt is All You Need: Prompting Foundation Models for Large-scale Self-supervised Semantic Segmentation

Key Points

This research aims to improve large-scale unsupervised semantic segmentation by utilizing foundation models.
Developed a cascade framework, PLUSS_alpha, using CLIPS, Grounding DINO, and SAM for zero-shot segmentation.
Introduced PLUSS_beta, which includes a semantic tuner for better category discrimination and a box tuner for improved localization.
Both tuners leverage knowledge within foundation models to optimize performance without external supervision.
PLUSS_beta outperforms previous state-of-the-art methods by 39.6%, 27.3%, and 22.6% in mIoU for 50, 300, and 919 categories, respectively.
Displays robust category-shape representation across different object sizes and dataset scales.
Shows strong generalization capabilities for open-vocabulary tasks.

Abstract

This paper addresses the important and challenging task of large-scale unsupervised semantic segmentation (LUSS). We present the first attempt to unleash the power of foundation models (FMs) for the challenging, dense prediction task LUSS, and our main objective is to present simple, effective yet efficient solutions for LUSS, namely Prompting foundation models for LUSS (PLUSS). Firstly, we proposed a cascade framework PLUSS\_ by effectively marrying CLIPS, Grounding DINO, and SAM in a zero-shot manner. This cascade architecture automatically generates semantic and spatial prompts for SAM, establishing a strong baseline that significantly outperforms previous state-of-the-art methods. Building upon this foundation, we propose PLUSS\_, which addresses the critical bottleneck of prompt quality through two novel tuner modules: a semantic tuner that enhances fine-grained category discrimination via visual prompt tuning, and a box tuner that improves object localization through cross-modal feature fusion. Both tuners are optimized by capitalizing on the knowledge already present within the foundation models themselves, deriving self-supervised signals from internal model consistency. This approach requires no external supervision or updates to the foundation models' parameters. Extensive experiments on ImageNet-S benchmarks demonstrate that PLUSS\_ achieves remarkable performance improvements, surpassing the previous best method by 39. 6%, 27. 3%, and 22. 6% in mIoU for 50, 300, and 919 categories respectively. Our approach exhibits robust category-shape representation across varying object sizes and dataset scales, while maintaining strong generalization capabilities for open-vocabulary tasks. The proposed framework provides a solid baseline for adapting foundation models to downstream vision tasks. Code is available at https: //github. com/Miss-Jo/PLUSSbeta.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Su et al. (Thu,) studied this question.

synapsesocial.com/papers/69ca134b883daed6ee095359 — DOI: https://doi.org/10.1109/tpami.2026.3673339

Authors

Jiaojiao Su

Central South University

Qin Luo

Chinese Academy of Medical Sciences & Peking Union Medical College

Shuzhou Sun

National University of Defense Technology

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

University of Oulu

Central South University

National University of Defense Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Prompt is All You Need: Prompting Foundation Models for Large-scale Self-supervised Semantic Segmentation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion