There have been significant advancements in object detection using extensive labelled datasets. However, existing learning-based approaches remain constrained in industrial environments, primarily due to the limited diversity in training datasets; the lack of generalisation of close-set detectors to unseen asset categories; and the inherent spatial and geometric complexity of mechanical, electrical, and plumbing (MEP) assets. To address this challenge, we propose a new approach that leverages pre-trained vision language models and close-set object detectors to detect unseen MEP assets using unlabelled data. Experimental results reveal the superior performance of Grounding DINO using Swin B transformer in open-vocabulary MEP asset detection, achieving the mean intersection over union (mIoU) of 0.6586 for valve detection and 0.4883 for pump detection. In addition, the combination of Grounding DINO (Swin B) and YOLOv8 outperforms other configurations in MEP asset detection, attaining the highest performance for both valve detection, with mean average precision at IoU = 0.5 (mAP50) of 0.928 and mean average precision over IoU threshold from 0.5 to 0.95 (mAP50:95) of 0.889, and pump detection, with corresponding values of 0.778 and 0.662, respectively. The quantitative and qualitative results of our approach were evaluated against fine-tuned Grounding DINO and fully supervised close-set object detectors.
Building similarity graph...
Analyzing shared references across papers
Loading...
Masoud Kamali
Behnam Atazadeh
Abbas Rajabifard
Sensors
The University of Melbourne
CRC for Spatial information
Building similarity graph...
Analyzing shared references across papers
Loading...
Kamali et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69df2c62e4eeef8a2a6b17a0 — DOI: https://doi.org/10.3390/s26082379