March 3, 2026Open Access

Zero-Shot Visual Anomaly Detection on a Quadruped Robot Using State of the Art Visual Language Models

Key Points

Anomaly detection was successfully implemented using visual language models in a dynamic environment, enhancing robotic perception.
Experimental evaluations confirmed both local and cloud-based models effectively identified various anomalies, achieving real-time results.
Real-time analysis was carried out through the integration of a quadruped robot with an onboard RGB camera and visual language models.
Findings highlight the potential for improving robotic systems with zero-shot perception, while examining limitations of current models.

Abstract

Anomaly detection in dynamic real-world environments remains a significant challenge for robotic systems, largely because traditional vision and rule-based methods struggle to interpret complex semantic contexts without task-specific training. Recent advances in Visual Language Models (VLMs) offer new opportunities for robots to perform zero-shot, context aware perception; however, their practical deployment on mobile robotic platforms is still underexplored. In this study, a quadruped robot autonomously patrolled both indoor and outdoor environments following a predefined trajectory and schedule to perform real time, zero-shot anomaly detection. The robot was equipped with an onboard RGB camera and integrated with a VLM to interpret visual data without any fine tuning. Anomalies were identified directly through prompt-based semantic reasoning, enabling detection of misplaced objects, structural defects, and environmental hazards. The overall framework was implemented under ROS2 to ensure seamless communication, control, and real time decision making. Two inference configurations were evaluated: a lightweight VLM running locally on the robot for on-device processing, and a more powerful cloud-basedVLMused for remote inference. Experimental results show that both configurations effectively identified diverse anomalies, demonstrating the feasibility of combining quadruped platforms with VLM-based zero-shot perception for continuous monitoring in dynamic environments. Furthermore, a comprehensive benchmarking study was conducted across multiple state-of-the-art VLMs Gemini-2.5-Pro, GPT-5, Claude Sonnet 4.5, Gemma3-27B, Qwen3-VL-30B, and LLaVA-13B together with human evaluators. This comparison enabled systematic assessment of model behavior and alignment with human judgment, providing deeper insights into the strengths and limitations of modern VLMs for embodied perception tasks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper