Distillation Bert model with Hugging Face.

Josue Nascimento
IA em Saúde: NeuralMed
2 min readJul 17, 2021

BERT is a bidirectional transformer model, pre-training with a lot of unlabeled textual data to learn language representations that can be used to fine-tune specific machine learning tasks.

The Bert base language is trained with BooksCorpus that have 800M of words and Wikipedia with 2500M of words, which results in a model with 110 million params.

To reduce the model size [1] showed a way to reduce the size of the pretrained Bert model having similar performance to the original model, this model is called distillbert with use knowledge distillation between two models teacher and student to better understanding read this article[1]. With resized model on downstream tasks achieves corresponding performance on average 97% of the model.

In this tutorial, I will show how to apply distillation to Bert model using hugging face library

First clone repository of hugging face transformers:

https://github.com/huggingface/transformers.git

After accessing the path:

transformers/examples/research_projects/distillation/scripts

In this path we will use 3 scripts:

binarized_data.py, token_counts.py, and train.py

Binarized Data

We need a dump file of sentences of your context work, to start the process of distillation we will first binarize all data of your dump and avoid preprocess each sentence at a time:

python scripts/binarized_data.py \
— file_path data/dump.txt \
— tokenizer_type bert \
— tokenizer_name bert_pretrained \
— dump_file data/binarized_text

tokenizer_type: is the model architecture of the model that will be distilled.

tokenizer_name: your model that will be distilled.

Token counts

The masked language modeling loss implemented in the distilled process puts more emphasis on rare words, so let’s count words.

python scripts/token_counts.py \
— data_file data/binarized_text.bertpickle \
— token_counts_dump data/token_counts.pickle \
— vocab_size 30522

Train distill language model

python train.py \
— student_type distilbert \
— student_config distilbert-base-json \
— teacher_type bert \
— teacher_name bert_pretrained\
— alpha_ce 5.0 — alpha_mlm 2.0 — alpha_cos 1.0 — alpha_clm 0.0 — dump_path dump_folder — batch_size 1 — n_epoch 5

teacher_name: Is the language model to be distilled

Here is the config file:

{
“activation”: “gelu”,
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“initializer_range”: 0.02,
“max_position_embeddings”: 512,
“n_heads”: 12,
“n_layers”: 6,
“sinusoidal_pos_embds”: true,
“tie_weights_”: true,
“vocab_size”: 30522
}

but this config can be changed as needed.

When the training process ends up the distilled model will be saved on dump_path.

Reference:

[1] https://arxiv.org/abs/1910.01108

[2] https://github.com/huggingface/transformers

--

--