Key points are not available for this paper at this time.
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaoqi Jiao
Yichun Yin
Lifeng Shang
Huazhong University of Science and Technology
Wuhan National Laboratory for Optoelectronics
Huawei Technologies (Sweden)
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiao et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d8ab72a5ecc596b5d1829a — DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.372
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: