Huawei & Tsinghua U Method Boosts Task-Agnostic BERT Distillation Efficiency by Reusing Teacher Model Parameters

Synced
SyncedReview
Published in
4 min readMay 4, 2021

--

Powerful large-scale pretrained language models such as Google’s BERT have been a game-changer in the arena of natural language processing (NLP) and beyond. The impressive achievements however have come with huge computational and memory demands, which has made it difficult to deploy such models on resource-restricted devices.

Previous studies have proposed task-agnostic BERT distillation to tackle this issue — an approach that aims to obtain a general small BERT model that can be fine-tuned directly like a teacher model (such as BERT-Base). But even task-agnostic BERT distillation is computationally expensive, due to the large-scale corpuses involved and the need to perform both a forward process for the teacher model and a forward-backward process for the student model.

In the paper Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation, a research team from Huawei Noah’s Ark Lab and Tsinghua University proposes Extract Then Distill (ETD), a generic and flexible strategy that reuses teacher model parameters for efficient and effective task-agnostic distillation that can be applied to student models of any size.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global