August 14, 2024Open Access

LLMI3D：通过单张2D图像赋能大型语言模型的3D感知

Key Points

Key points are not available for this paper at this time.

Abstract

自动驾驶、增强现实、机器人技术和具身智能的最新进展推动了对3D感知算法的需求。然而，当前的3D感知方法，尤其是小型模型，在逻辑推理、问答处理和开放场景类别处理方面存在困难。另一方面，生成式多模态大型语言模型（MLLM）在通用能力上表现优异，但在3D任务中表现欠佳，原因在于其空间和局部物体感知能力较弱、基于文本的几何数值输出较差以及无法处理摄像头焦距变化。为解决这些挑战，我们提出了以下方案：增强空间特征提取的空间增强局部特征挖掘、实现精确几何回归的基于3D查询标记的信息解码以及应对摄像头焦距变化的基于几何投影的3D推理。我们采用参数高效微调预训练的MLLM，开发了强大的3D感知MLLM——LLMI3D。此外，我们构建了IG3D数据集，提供细粒度描述和问答注释。大量实验证明，LLMI3D实现了最先进的性能，显著优于现有方法。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Fan Yang

Sicheng Zhao

Yanhao Zhang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLMI3D：通过单张2D图像赋能大型语言模型的3D感知

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider