Unmanned aerial vehicles (UAV) are increasingly deployed to assist humans in diverse tasks, where understanding human intentions is critical to effective collaboration. Referring expression comprehension (REC) links language to visual targets, allowing UAV to recognize human-intended targets of interest, thereby supporting subsequent actions. However, existing REC research is almost exclusively confined to ground-based scenarios, leaving aerial scenarios largely unexplored. In this paper, we formally define UAV-based REC as a new research problem and highlight its unique challenges, including abundant background interference, small target size, and complex referring relations. To enable systematic study, we introduce SkyFind, a large-scale dataset with one million high-quality target-expression pairs, providing a solid foundation. In addition, we propose AerialREC, a baseline framework that reduces background interference in UAV imagery by searching for a potential target region before localization. We establish benchmark results on SkyFind using ten representative REC methods and validate the effectiveness of the AerialREC framework.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kai Wang
Guanbo Wu
Xueyang Fu
IEEE Transactions on Pattern Analysis and Machine Intelligence
University of Science and Technology of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d892d16c1944d70ce040a4 — DOI: https://doi.org/10.1109/tpami.2026.3681112