January 1, 2020Open Access

TinyBERT: Distilling BERT for Natural Language Understanding

Key Points

Key points are not available for this paper at this time.

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Xiaoqi Jiao

Yichun Yin

Lifeng Shang

Actions

Institutions

Huazhong University of Science and Technology

Wuhan National Laboratory for Optoelectronics

Huawei Technologies (Sweden)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

TinyBERT: Distilling BERT for Natural Language Understanding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider