October 1, 2019

Learning to Assemble Neural Module Tree Networks for Visual Grounding

Key Points

Key points are not available for this paper at this time.

Abstract

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Liu et al. (Tue,) studied this question.

www.synapsesocial.com/papers/6a09644016dfdfe7ed340cc4 — DOI: https://doi.org/10.1109/iccv.2019.00477

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Neural Baby Talk· 2018 · 448 citations
Effective Approaches to Attention-based Neural Machine Translation· 2015 · 8,563 citations
Statistical Theory of Extreme Values and Some Practical Applications.· 1955 · 655 citations
Glove: Global Vectors for Word Representation· 2014 · 33,579 citations
A technique for measurement of attitudes· 1932 · 8,161 citations

Authors

Daqing Liu

Hanwang Zhang

Zheng-Jun Zha

Actions

Institutions

Nanyang Technological University

University of Science and Technology of China

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Learning to Assemble Neural Module Tree Networks for Visual Grounding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion