Human-Object Interaction (HOI) detection is a challenging task in computer vision, particularly in complex scenes involving multiple humans and interactions. In this paper, we propose the Hierarchical Tuple-based Contextual Correlations Learning (HTCCL) model, which aims to enhance HOI detection by systematically capturing multi-level contextual relationships. Our approach decomposes an interaction into three hierarchical levels: entity, action, and event. We introduce a heterogeneous graph network with a multi-branch Transformer architecture, where human and object entities are treated as distinct nodes, facilitating fine-grained relational reasoning. Furthermore, we leverage Contrastive Language-Image Pre-training model to embed interaction cues into queries, which are subsequently refined through local and global contextual aggregation modules. The proposed model effectively integrates contextual information across various levels, improving its ability to detect complex interactions within diverse scenes. Our extensive evaluations on standard benchmarks demonstrate the superiority of HTCCL in achieving state-of-the-art performance in HOI detection, particularly in scenarios with high relational complexity.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Hu
Ke Qin
Tao He
Tsinghua Science & Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Hu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69df2b85e4eeef8a2a6b0752 — DOI: https://doi.org/10.26599/tst.2025.9010025
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: