What question did this study set out to answer?

To improve detection of mechanical, electrical, and plumbing assets by leveraging unlabelled data and pre-trained models.

April 15, 2026Open Access

A Vision Language-Based Framework for Detecting Industrial Mechanical, Electrical, and Plumbing Assets Using Unlabelled Data

Key Points

To improve detection of mechanical, electrical, and plumbing assets by leveraging unlabelled data and pre-trained models.
Utilized vision language models for asset detection
Employed close-set object detectors
Conducted experiments with Grounding DINO using Swin B transformer and YOLOv8
Evaluated performance using metrics like mean intersection over union and mean average precision
Achieved mIoU of 0.6586 for valve detection and 0.4883 for pump detection
Grounding DINO and YOLOv8 outperformed other configurations
Attained mAP50 of 0.928 for valve detection and 0.778 for pump detection
Overall, the framework demonstrated strong performance in detecting unseen MEP assets

Abstract

There have been significant advancements in object detection using extensive labelled datasets. However, existing learning-based approaches remain constrained in industrial environments, primarily due to the limited diversity in training datasets; the lack of generalisation of close-set detectors to unseen asset categories; and the inherent spatial and geometric complexity of mechanical, electrical, and plumbing (MEP) assets. To address this challenge, we propose a new approach that leverages pre-trained vision language models and close-set object detectors to detect unseen MEP assets using unlabelled data. Experimental results reveal the superior performance of Grounding DINO using Swin B transformer in open-vocabulary MEP asset detection, achieving the mean intersection over union (mIoU) of 0.6586 for valve detection and 0.4883 for pump detection. In addition, the combination of Grounding DINO (Swin B) and YOLOv8 outperforms other configurations in MEP asset detection, attaining the highest performance for both valve detection, with mean average precision at IoU = 0.5 (mAP50) of 0.928 and mean average precision over IoU threshold from 0.5 to 0.95 (mAP50:95) of 0.889, and pump detection, with corresponding values of 0.778 and 0.662, respectively. The quantitative and qualitative results of our approach were evaluated against fine-tuned Grounding DINO and fully supervised close-set object detectors.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Masoud Kamali

Behnam Atazadeh

Abbas Rajabifard

Journals

Sensors

Actions

Institutions

The University of Melbourne

CRC for Spatial information

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Vision Language-Based Framework for Detecting Industrial Mechanical, Electrical, and Plumbing Assets Using Unlabelled Data

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study