Key points are not available for this paper at this time.
This paper addresses the limitation of vision-based UE localization, which performs well when the UE is within the camera field of view (FoV) but degrades when the UE moves outside the FoV. To overcome this, we propose a multimodal UE localization approach that jointly exploits 5G radio measurements and visual information from red-green-blue (RGB) frames. Radio, image, and ground-truth data are temporally aligned using time sample information. Compact visual embeddings are extracted using a pre-trained ResNet50 backbone. These embeddings are fused with selected radio features to form a unified fingerprint, which is mapped to the 3D UE position using a multilayer perceptron. Experiments on the ICASSP 2026 CONVERGE Task 2 dataset show that the proposed multimodal approach achieves low localization error and consistently outperforms single-modality baselines.
Saba et al. (Tue,) studied this question.