# Papers Explained 05: Tiny BERT

Knowledge Distillation aims to transfer the knowledge of a large teacher network T to a small student network S. Let fT and fS represent the behavior functions of teacher and student networks, respectively.

In the context of Transformer distillation, the output of MHA layer or FFN layer, or some intermediate representations (such as the attention matrix A) can be used as behavior function. Formally, KD can be modeled as minimizing the following objective function:

where L(·) is a loss function that evaluates the difference between teacher and student networks, x is the text input and X denotes the training dataset.

Assuming that the student model has M Transformer layers and teacher model has N Transformer layers, we start with choosing M out of N layers from the teacher model for the Transformer-layer distillation. Then a function n = g(m) is defined as the mapping function between indices from student layers to teacher layers.

Thus, the student can acquire knowledge from the teacher by minimizing the following objective:

where *Llayer* refers to the loss function of a given model layer. and *λm* is the hyperparameter that represents the importance of m-th layers’s distillation.

**Transformer Layer Distillation**

where h is the number of attention heads and *Ai* refers to the attention matrix corresponding to the i-th head.

where the matrics *HS* and *HT* refer to hidden states of student and teacher networks respectively. The matrix *Wh* is a learnable linear transformation which transforms the hidden states of student network into the same space as the teacher network’s states.

**Embedding Layer Distillation**

where the matrices *ES* and *ET* refer to the embeddings of student and teacher networks, respectively. The matrix *We* is a linear transformation playing a similar role as *Wh*.

**Prediction Layer Distillation**

where *ZT* and *zS* are the logits vectors predicted by the student and teacher respectively and t means the temperature value. In the experiments, it was found that t = 1 performs well.

**Unified Distillation Loss**

Using the above distillation objectives, we can unify the distillation loss of the corresponding layers between the teacher and the student network:

# TinyBERT Learning

TinyBERT proposed a novel two-stage learning framework including the general distillation and the task-specific distillation.

General distillation helps TinyBERT learn the rich knowledge embedded in pre-trained BERT, which plays an important role in improving the generalization capability of TinyBERT. The task-specific distillation further teaches TinyBERT the knowledge from the fine-tuned BERT.

**TinyBERT Settings**

TinyBERT4

- Student: TinyBERT4 (M=4, d=312, d’=1200 h=12) has a total of 14.5M parameters
- Teacher: BERT BASE (M=12, d=768, d’=3072 h=12) has a total of 109M parameters
- g(m) = 3m, \lambda = 1
*λ*=1

TinyBERT6

- Student: TinyBERT6 (M=6, d=768, d’=3072 h=12) has a total of 14.5M parameters
- Teacher: BERT BASE (M=12, d=768, d’=3072 h=12) has a total of 109M parameters
- g(m) = 2m, \lambda = 1
*λ*=1

# Results

# Paper

TinyBERT: Distilling BERT for Natural Language Understanding 1909.10351

**Hungry for more insights?**

**Don’t miss out on exploring other fascinating threads in this series. Simply click ****here**** and uncover the state-of-the-art research!**

**Do Subscribe for weekly updates!!**