April 25, 2022

WebFormer: The Web-page Transformer for Structure Information Extraction

Key Points

Key points are not available for this paper at this time.

Abstract

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Mon,) studied this question.

www.synapsesocial.com/papers/6a0812f71e0fcf4a43e8a48a — DOI: https://doi.org/10.1145/3485447.3512032

Also consider

Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context:

Aggregating Inter-Sentence Information to Enhance Relation Extraction· 2016 · 23 citations
EXACT· 2019 · 7 citations
Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences· 2021 · 10 citations
Template Induction over Unstructured Email Corpora· 2017 · 16 citations

Authors

Qifan Wang

Yi Fang

Anirudh Ravula

Actions

Institutions

Google (United States)

University of Science and Technology of China

Sun Yat-sen University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

WebFormer: The Web-page Transformer for Structure Information Extraction

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion