BERT — A Practitioner’s Perspective

Published in

The Startup

8 min readJul 30, 2020

What is BERT?

BERT stands for “Bidirectional Encoder Representations from Transformers”. It is currently the leading language model. According to published results it (or its variants) has hit quite a few language tests out of the park. If interested your can read the paper here: Bidirectional Encoder Representations from Transformers. The appendix section (of the paper) discusses the different types of language tasks that they tested BERT on.

Details

There are many articles online that explain what BERT does. I found most of them cumbersome and loaded with details that make it even harder to understand. Let me try to explain it in simple terms (i.e. not as a researcher who is planning to improve BERT but as a person who is interested in using BERT)

As a black-box

Let us first understand BERT as a black-box.

The above picture is taken from the paper. Let us first understand what the inputs and outputs for BERT are. You can see that there are multiple ways in which you can submit inputs to BERT.

What can you input?

Single sentence — In the above figure you can see illustrations b and d where you feed a single sentence to BERT. This is done for the case of sentence classification (Sentiment Analysis) or Sentence Tagging (E.g. Named Entity Recognition)
Two sentences (separated by a marker) — In the above figure you can see illustrations a and c where you feed two sentences separated by a marker (SEP). This is done for the cases of sentence-pair classification tasks such as sentence-similarity, or knowing if sentence-2 follows sentence-1 or obtaining answers (for questions) from a given paragraph of text

What is the output?

For single sentence tasks — For single sentence tasks (b and d in the figure) you either consume the Class-Label (in b) for sentence classification (sentiment analysis) tasks or a tag (in d) for sentence-tagging tasks (named entity recognition etc.)
For two sentence tasks — For cases where you input two sentences (a and c in the figure) you either consume the class-label (in a) which can be a score between 0 and 1 (for example, indicating the probability of sentence 2 following sentence 1 etc.) Or you can consume the the second part of the output tags (in c) which is a representation of the answer you require (for the question and paragraph pair input to BERT)

Each output tag (C or T) listed in the output is a vector in a H-dimensional space. Where H is 768 (as per the paper) and most implementations of BERT give you an embedding in 768 dimensions.

Cool! So, far so good. If you are just interested in using BERT then you are good to go. You can directly install a few open source libraries and start playing with BERT. Examples are listed below.

Example

You can check the examples quoted here to see how straight-forward it is to use BERT. Probably not more than 10 lines of code. I will not quote the code here, you can check the links for code.

Sentence summarizer — https://pypi.org/project/bert-extractive-summarizer/
Sentence encoder — https://pypi.org/project/sentence-transformers/

Advanced — Step 1 — Pre-training & Fine-tuning

Pre-training

The major motivation behind BERT seems to be to build a model that is pre-trained with an existing corpus and then the same model can be fine-tuned to be used for different tasks. For example in the above figure we see that the same BERT model is being used for various tasks. So, what the research team did was to build a BERT model and trained it using English-Wikipedia (2500M words) and Books corpus (800M words) on two tasks. The learning tasks are also simple:

Masked LM (MLM) — They Mask 15% of the input tokens (i.e. words) with a [MASK] and then they will make BERT guess that [MASK] label and train it accordingly. When they do this training they do not check if the entire sentence is output in the order. They just make sure that model guesses the [MASK] label correctly. To also ensure that they do not create a mismatch between pre-training and fine-tuning they do the following adjustments: If the i-th token is chosen (at random), we replace the i-th token with(1) The[MASK] token is replaced 80% of the time (2) a random token 10% of the time (3)the unchanged i-th token 10% of the time. Then,Ti will be used to predict the original token with cross entropy loss.
Next Sentence Prediction (NSP) — They input two sentences A and B to BERT and they make it predict if B follows A or not. 50% of the inputs are where B follows A and 50% is where it does not. Sentences A and B are input to BERT with a [SEP] word inserted between them.

Fine-tuning

So, in the previous step you have a BERT that is pre-trained with some corpus and on some learning tasks. Now you have a model that outputs a set of Tags with each tag/output ‘T’ in a H-dimensional space (768 dimensions as per the paper). Cool! Now what you need to do is to fine-tune the entire model for your use-case. You can attach this BERT output layer to another layer of your choice (i.e. a multi-label classifier etc. ). The paper lists a set of 11 tasks that they fine tuned the pre-trained BERT for and then the results they obtained for those tasks. When they fine-tune they make sure that weights are tuned across the entire model (and not just to the layers attached on-top of BERT).

Advanced — Step 2 — Under the hood

Transformers

BERT basically leverages Transformers. This paper “Attention is all you need” discusses the concept of transformers. Transformers is nothing but an encoder-decoder architecture. For full details, refer to this article: http://jalammar.github.io/illustrated-transformer/ (Pictures are taken from this article). Even this article is good: https://towardsdatascience.com/transformers-141e32e69591. Now Transformers are the bleeding edge in NLP and they are replacing/replaced LSTM based RNN models.

Encoder — Decoder layers in a Transformer

From the above example you can see that transformer is leveraged for translation of sentences (French to English in the above example) or even for sentence prediction. When we peer within the encoder and decoder, the layers at a superficial level are as follows:

And transformers in turn leverages a concept of Attention. The idea behind Attention is that the context of a word in a paragraph is captured, in some sense, by all the other words in that paragraph/document. You can refer to the links I mentioned above for more details.

Attention

Now attention can be achieved using multiple methods. Google’s attention paper I mentioned above does not use RNN/CNN on top of a attention layer. But I believe there are other approaches like the one mentioned in this video (not me btw :) ) where attention is used in conjunction with RNNs:

Achieving Attention

Transformers contd…

With reference to the Google paper (which is a well cited one), when you zoom into the encoding and decoding layers, transformer architecture looks like this.

The fun part is that instead of attention this one talks about multi-head attention. What is the difference?

Single attention apparently brings focus to a single area of the picture / corpus at a time. Whereas multi-head attention ensures that multiple areas of the corpus / image are focused upon at the same time.

More details can be found here: NLP — Bert & Transformer

And Google paper proposes their multi-head attention mechanism like this:

Now, if you are still interested in digging into the details relating to what those V, K and Q vectors are and what linear operations are applied on them etc. then the best place to get those details is the paper. Even this article [NLP — Bert & Transformer] has a very good description of these details. If I talk about them in my article then it will become super long and boring. For this article it is sufficient enough to understand that there is something called as attention and it is used in a transformer.

Uses of a Transformer

Transformer is used in BERT. So, that is a major use. As I said already they are replacing the LSTM based RNNs, so probably whenever you think about RNN for your project/problem you should also give Transformer a thought and see if you can train a transformer instead of a RNN for your use-case. It is an encoder decoder architecture and you should be able to use Transformer in such use-cases. Some typical problems that are tackled with a transformer are as follows:

Next-Sentence prediction
Question Answering
Reading comprehension
Sentiment analysis (and)
Text summarization

BERT

Now coming to BERT, we already said that it builds on top of transformers. BERT basically builds on the knowledge gained from other models built a few years ago, E.g. ELMo and OpenAI GPT. The major difference is that BERT does a fully-connected model where ELMO and OpenAI GPT use only feed-forward connections or separate blocks where you do forward and backward connections. This method of doing a fully-connected transformer network seems to ensure that attention is properly distributed across the entire network thereby ensuring that the model learns the language.

Why it works?

The paper talks about the reasons behind the apparent success of BERT over other models. But no one really knows the precise reasons behind the success of these models (or for that matter any deep-learning model). This is an active area of research and you may want to consider it for your thesis :) . Also, it means that DL modeling is more art and less science. But the BERT paper deserves credit because the authors seem to be well aware of many other architectures, their inner workflows and a hunch about what may or may-not work. They also addressed three key things which makes BERT out-right appealing and interesting.

Unsupervised learning for pre-training
Using a readily available corpus like Wikipedia for pre-training
Reducing the effort required to fine-tune for specific tasks

How to use BERT?

There are many libraries available for BERT. Some notable ones that I have come across are:

Conclusion

In conclusion, if you are just consuming pre-trained BERT then it is pretty straight-forward. Huggingface also has some fine-tuned models that others have shared with the community. If you wish to fine-tune BERT for your own use-cases and if you have some tagged data then you can use huggingface transformers and pyTorch to fine-tune a pre-trained BERT for your use-case. So, why are you waiting. Just install and explore BERT.