Image by Cedric Yong from Pixabay

Member-only story

Explaining BERT Simply Using Sketches

BERT introduced by Google in 2018 was one of the most influential papers for NLP. But it is still hard to understand.

Rahul Agarwal
11 min readApr 14, 2021

--

In my last series of posts on Transformers, I talked about how a transformer works and how to implement one yourself for a translation task.

In this post, I will go a step further and try to explain BERT, one of the most popular NLP models that utilize a Transformer at its core and which achieved State of the Art performance on many NLP tasks including Classification, Question Answering, and NER Tagging when it was first introduced.

Specifically, unlike other posts on the same topic, I will try to go through the highly influential BERT paperPre-training of Deep Bidirectional Transformers for Language Understanding while keeping the jargon to a minimum and try to explain how BERT works through sketches.

So, what is BERT?

In simple words, BERT is an architecture that can be used for a lot of downstream tasks such as question answering, Classification, NER etc. One can assume a pre-trained BERT as a black box that provides us with H = 768 shaped vectors for each input token(word) in a sequence. Here, the sequence can be a single sentence or a pair of sentences separated by…

--

--

Responses (1)