Key points are not available for this paper at this time.
ABSTRACT Open‐vocabulary 3D querying based on 3D Gaussian splatting (3DGS) shows great promise in facilitating accurate 3D query capabilities of AI systems. These methods typically rely on pre‐captured multi‐view images to enable natural language interactions with 3D scenes. In practice, when embodied AI encounters unexplored scenes, it is difficult to obtain observations from different viewpoints beforehand. This challenge highlights the importance of exploring natural language‐driven 3D scene querying from a single current viewpoint. This paper proposes single view language Gaussian splatting (SVLGaussian) for the novel task: Open‐vocabulary 3D querying based on the input single view. By leveraging multi‐round inference of multimodal large language models, SVLGaussian efficiently generates pixel‐level semantic probabilities and rapidly embeds them into a 3D Gaussian field, enabling real‐time language‐guided semantic querying. To verify our model, we annotated three datasets: Lerfₒvs and 3D‐OVS, which are tailored for open‐vocabulary 3D querying, and RE10K, which is adapted for single‐view 3D reconstruction. Both quantitative and qualitative results show that our method effectively supports open‐vocabulary 3D querying from a single view.
Wang et al. (Sat,) studied this question.