Distill-BERT: Using BERT for Smarter Text Generation

Generate better stories, essays, answers and more using BERT’s knowledge

Rohit Pillai
The Startup
5 min readAug 7, 2020

--

Photo by MILKOVÍ on Unsplash

The field of natural language processing is now in the age of large scale pretrained models being the first thing to try for almost any new task. Models like BERT, RoBERTa and ALBERT are so large and have been trained with so much data that they are able to generalize their pre-trained knowledge to understand any downstream tasks that you can use them for. But that’s all they can do — understand. If you wanted to answer a question that wasn’t multiple choice, write a story or an essay or anything that required free form writing, you’d be out of luck.

Now don’t get me wrong, just because BERT-like models can’t write stories, that doesn’t mean that there aren’t models out there that can. Introducing the Sequence to Sequence (Seq2Seq) model. When we write a story, we write the next word, sentence or even paragraph based on what we’ve written so far. This is exactly what Seq2Seq models are designed to do. They predict the most likely next word based on all the words they’ve seen so far by modeling them as a time series i.e. the order of the previous words matters.

Seq2Seq models have been around for a while and there are several variants that are used for text generation tasks like summarization and translating one language to another. The exploration of Seq2Seq models has culminated in the development of models like GPT-2 and GPT-3, which can complete news snippets, stories, essays and even investment strategies — all from a few sentences of context! Forewarning though, not all these generated pieces of text make very much sense when you read them — probability distributions over words can only take you so far.

A few of the fundamental units used in designing these models are Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM) and Transformers (a combination of an encoder and a decoder that learns a representation for words using numbers) — which also form the backbone of BERT-like models and GPT-2/3.

A natural question now arises — if a Seq2Seq model is used as the backbone of both BERT-like models and GPT, why can the BERT-like models not generate text? It’s because they’re trained in a way that considers both the future and past context. During training, these models are given sentences with a few missing words as input and they’re expected to predict these missing words. To predict the missing word, they need to know what the words before mean as well as those after. In this spirit, there has been work done on trying to get BERT-like models to work for text generation such as Yang et al.’s CT-NMT

Another train of thought of BERT-like models for text generation is based on this question: Can the knowledge of future words that these models have from their training help Seq2Seq models formulate more coherent sentences instead of just predicting the next word? This is exactly the problem that the researchers at Microsoft Dynamics 365 AI Research try to answer using Distill-BERT.

They use knowledge distillation to transfer the knowledge from a teacher BERT model to a student Seq2Seq model, while also maintaining the original Seq2Seq goal of predicting the most likely next word. This way, the student model retains the best of both worlds. A more formal explanation of this technique is shown in the below equations.

The objective used to train the student Seq2Seq model
BERT part of the objective

Here (yt) is a list of probabilities predicted by BERT of all words being relevant at position t in the generated text.

Original Seq2Seq objective

Once the student model is trained, the teacher BERT model is no longer needed and only the student model is used to generate the text. This means that at generation time, there are no additional resources required for Distill-BERT. This technique is also teacher agnostic. This means that any BERT-like model like RoBERTa, ALBERT, BERT and more can be used to distil knowledge to the student.

To prove their method works, the researchers distil BERT’s knowledge to train a student transformer and use it for German-to-English translation, English-to-German translation and summarization. The student transformer shows significant improvement over a regular transformer without BERT and is even able to achieve state of the art performance on German-to-English translation.

They also apply this knowledge to a student RNN, showing that the technique is also student agnostic. This RNN is applied for English-to-Vietnamese translation and shows improvement as well.

Here’s a link to the paper if you want to know more about Distill-BERT , a link to the code if you want to try to train your own Seq2Seq model and click here to see more of our publications and other work.

References

  1. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, Language models are unsupervised multitask learners. OpenAI Blog 1, no. 8 (2019): 9.
  2. Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).
  3. Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil, Model compression (2006), In KDD.
  4. Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang and Lei Li, Towards making the most of bert in neural machine translation (2019), arXiv preprint arXiv:1908.05672
  5. Chen, Yen-Chun, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu, Distilling Knowledge Learned in BERT for Text Generation, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7893–7905. 2020.

--

--

Rohit Pillai
The Startup

I’m an engineer at Microsoft Dynamics 365 AI Research and I’ll post our new NLP, CV and Multimodal research . Check out https://medium.com/@rohit.rameshp