Power-grid field operations demand real-time visual monitoring to verify personal protective equipment and tool usage under large depth-of-field. Conventional real-time detectors are efficient but closed-vocabulary; they struggle with rare or unseen objects. Large multimodal models (LMM) offer open-vocabulary understanding guided by prompts, yet are too heavy for edge deployment. To address these challenges, We propose an LMM-guided distillation framework that transfers prompt-grounded semantics from a large teacher to a lightweight YOLO-style student. The teacher, queried with expanded prompt set, produces pseudo labels and region–text embeddings. The student is trained with a standard detection objective and three semantic transfers. Firstly, feature distillation aligns student features to teacher region embeddings via a linear projector; Secondly, prompt-aware logit distillation matches student logits to the teacher’s temperature-smoothed prompt distribution; and thirdly, vision–language contrastive alignment ties projected student regions to the correct prompt embedding. Experiments on two benchmark dataset indicate consistent gains on both common and rare categories while retaining real-time throughput on edge hardware, demonstrating a practical cloud-to-edge pipeline for safety monitoring.
Building similarity graph...
Analyzing shared references across papers
Loading...
Bingyang Li
Xiangyang Zhang
Lin Li
Journal of Cloud Computing Advances Systems and Applications
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69ada962bc08abd80d5bca84 — DOI: https://doi.org/10.1186/s13677-026-00872-y