BERT (Bidirectional Encoder Representations from Transformers) | one minute summary

It took the Transformer, and transformed it to make it even more useful

Published in

One Minute Machine Learning

2 min readJul 8, 2021

BERT was introduced in the 2018 paper: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, by Devlin, J., et al (Google) to take on the specific task of creating a good language representation model.

Why? For computer vision tasks, one can often take a pre-trained model like Inception or ResNet and then fine-tune it for their specific task. But for NLP, there has not been a generic pre-trained model. One reason for this is because before transformers, language models have largely been unidirectional (i.e. did not take into account the context of a sentence from both directions).
What? BERT (Bidirectional Encoder Representations from Transformers) is simple but powerful model that learn the context of unlabelled text from both the left and right side, and can therefore be used as a good pre-trained NLP model.
How? BERT is essentially the encoder part of a transformer, but with an extra step before: randomly masking some of the input tokens and adding the objective of predicting the masked word based only on the context of the surrounding tokens. Because a transformer architecture is used, the entire token sequence is input all at once (i.e. the context of the input is learned from both the left and right sides at the same time).