The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68f4b10d3d9d770bbc696d56 — DOI: https://doi.org/10.48550/arxiv.2503.23765
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yun Li
Yiming Zhang
Tao Lin
Building similarity graph...
Analyzing shared references across papers
Loading...