Anomaly detection in dynamic real-world environments remains a significant challenge for robotic systems, largely because traditional vision and rule-based methods struggle to interpret complex semantic contexts without task-specific training. Recent advances in Visual Language Models (VLMs) offer new opportunities for robots to perform zero-shot, context aware perception; however, their practical deployment on mobile robotic platforms is still underexplored. In this study, a quadruped robot autonomously patrolled both indoor and outdoor environments following a predefined trajectory and schedule to perform real time, zero-shot anomaly detection. The robot was equipped with an onboard RGB camera and integrated with a VLM to interpret visual data without any fine tuning. Anomalies were identified directly through prompt-based semantic reasoning, enabling detection of misplaced objects, structural defects, and environmental hazards. The overall framework was implemented under ROS2 to ensure seamless communication, control, and real time decision making. Two inference configurations were evaluated: a lightweight VLM running locally on the robot for on-device processing, and a more powerful cloud-basedVLMused for remote inference. Experimental results show that both configurations effectively identified diverse anomalies, demonstrating the feasibility of combining quadruped platforms with VLM-based zero-shot perception for continuous monitoring in dynamic environments. Furthermore, a comprehensive benchmarking study was conducted across multiple state-of-the-art VLMs Gemini-2.5-Pro, GPT-5, Claude Sonnet 4.5, Gemma3-27B, Qwen3-VL-30B, and LLaVA-13B together with human evaluators. This comparison enabled systematic assessment of model behavior and alignment with human judgment, providing deeper insights into the strengths and limitations of modern VLMs for embodied perception tasks.
Aydogmus et al. (Thu,) studied this question.