Published in


Photo by Shubham Sharan on Unsplash

🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT

2019, October 3rd — Update: We are releasing our NeurIPS 2019 workshop paper describing our approach on DistilBERT with improved results: 97% of BERT’s performance on GLUE (the results in the paper superseed the results presented here). The approach is slightly different from the one explained in this present blog post so this blog post should be a good entry point to the paper! We applied the same method to GPT2 and are releasing DistilGPT2! Training code and pre-trained weights for DistilBERT and DistilGPT2 are available here. 🤗

Some people in the community question the relevance of keeping on training larger and larger Transformer especially when you take into account the financial and environmental cost of training. Here’s are some of the latest large models and their size in millions of parameters.

⚗️ Knowledge Distillation — Transferring generalization capabilities

The top 20 guesses from BERT (base) for the masked token. The Language model identified two highly probable tokens (day & life) followed by a long tail of valid tokens.

👯‍♂️ How can we copy this dark knowledge?

We are training the student to generalize the same way as the teacher by matching the output distribution.

With t the logits from the teacher and s the logits of the student
T is the temperature parameter.

🗜Hands-on coding in PyTorch — Compressing BERT

A Knowledge distillation training step in PyTorch. Copy the gist from here.

— Why not reducing the hidden size as well?
Reducing it from 768 to 512 would reduce the total number of parameters by ~2. However, in modern frameworks, most of the operations are highly optimized and variations on the last dimension of the tensor (hidden dimension) have a small impact on most of the operations used in the Transformer architecture (linear layers and layer normalisation). In our experiments, the number of layers was the determining factor for the inference time, more than the hidden size.
Smaller does not necessarily imply faster…

— Some works on distillation like Tang et al. use the L2 distance as a distillation loss directly on downstream tasks.
Our early experiments suggested that the cross-entropy loss leads to significantly better performance in our case. We hypothesis that in a language modeling setup, the output space (vocabulary) is significantly larger than the dimension of the downstream task output space. The logits may thus compensate for each other in the L2 loss.

🎢 Model performances — Testing DistilBERT

Comparison on the dev sets of the GLUE benchmark. ELMo results as reported by the authors. BERT and DistilBERT results are medians of 5 runs with different seeds.

🔮 Downstream task: Distillation & transfer-learning

Extract from the IMDB Review dataset — Source: Kaggle

As noted by the community, you can reach comparable or better score on the IMDB benchmark with lighter methods (size-wise and inference-wise) like ULMFiT. We encourage you to compare on your own use-case! In particular, DistilBERT can give a sensible lower-bound on Bert’s performances with the advantage of faster training.

Here we are finetuning by distilling a question answering model into a language model previously pre-trained with knowledge distillation! That a lot of teachers and students🎓

🙌 Less is more: smaller models also spark joy 🌟




Stories @ Hugging Face

Recommended from Medium

5-minute Paper Review: Evolutionary Stochastic Gradient Descent

Simple Linear Regression with Python

The Power of Multi-Task Learning for Bioinformatics (Potentially)

The Complete Guide on Learning Deep Learning

Building the future of knowledge sharing — A closer look at Lunyr’s advertising system

First Chinese Sample-Return Lunar Mission

Calculus in Machine Learning

Ensemble Technique

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Victor Sanh

Victor Sanh

Dog sitter by day, Scientist at @huggingface 🤗 by night | Into Natural Language Processing, started with Computer Vision

More from Medium

Transformer models: an introduction and catalog

Transliteration: App Overview

No data? No problem! Generating synthetic training data at scale for NLP tasks using T0PP

Context Counts: How to Use Transfer Learning and Model-Aided Labeling to Train Data Tailored Models