Vision-language models (VLMs) offer transformative potential for robotics, but their deployment is constrained by performance limitations. In safety-critical manipulation, a model must recognize its own limitations to prevent a catastrophic failure. We conduct a systematic study of VLMs for robotic failure detection, evaluating six architectures on real-world trajectories. We put forward a decision-making process that allows a VLM to evaluate whether it can successfully complete a task, and if not, pause its operation and hand over the task to human operators. Our results show that well-calibrated VLMs can be trustworthy partners that know exactly when to ask for help.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chowdhury et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69e1cfe05cdc762e9d858eec — DOI: https://doi.org/10.1109/mpuls.2026.3659245
Md Sameer Iqbal Chowdhury
Tsz-Chiu Au
Building similarity graph...
Analyzing shared references across papers
Loading...