BERT for Dummies: State-of-the-art Model from Google

8 min readOct 1, 2022

Exceeds human performance on language understanding benchmark

Understanding language — has always been a difficult affair for computers. Sure, computers can collect, store, and read textual inputs, but they lack basic language context.

Then came Natural Language Processing (NLP); a new field of AI that aimed at enabling computers to: read, analyze, interpret and derive meaning from text and spoken words, just like we do as humans. NLP combines: linguistics, statistics, and Machine Learning to assist computers in ‘understanding’ human language.

Over the years, individual NLP tasks were solved by custom models created for each specific task. For example, I’ve done a sentiment analysis project tutorial, where we built a completely independent model, having its own intelligence to understand language, as there was no way to borrow this language understanding from some external source and transfer it for our use case.

And this changed with BERT 💪

Brief on this learning series..

Well, this article is actually the second instalment of my three part learning series, where we are:

Understanding intuition behind Transfer Learning,
Deep-diving into Google’s BERT Model — which has achieved superhuman performance in its language understanding, and finally
Training (actually, fine-tuning) a Fake News Detection Model, by Transferring Learning from pre-trained BERT Model

Now, let’s continue further on this second part, which is, BERT for dummies.

Watch the video tutorial instead

If you are more of a video person, go ahead and watch it on YouTube, instead. Make sure to subscribe to my channel to get access to all of my latest content.

BERT capabilities

BERT serves to be the swiss army knife solution for 11+ of the most common language tasks. It has an insane understanding of 70+ global languages, including English.

So, with Transfer Learning, we may use BERT as our base model, and fine-tune it for our specific NLP problem, like: sentiment analysis, article summarization, fake news detection, etc. With this, we have a production ready highly accurate model in no time.

How BERT got trained?

BERT, which stands for Bidirectional Encoder Representations from Transformers, is an open source machine learning framework for natural language processing (NLP), developed by researchers at Google in 2018.

The BERT framework was specifically trained on Wikipedia (~2.5B words) and Google Books Corpus (~800M words). These large informational datasets contributed to BERT’s deep knowledge not only of the English language but also of our world!
Training on a dataset this large took a long time. BERT’s training was made possible thanks to the novel Transformer architecture and sped up by using Tensor Processing Units — Google’s custom circuit built specifically for large ML models
With ~64 of these TPUs, BERT training took around 4 days.

Originally, two BERT models were released by Google, BERTlarge and the smaller BERTbase (which has slightly lower accuracy, but is still comparable to other state-of-the-art models on performance). We shall be using BERTbase for our hands-on in this tutorial..

BERT Architecture

Here’s the visualization of the BERT network created by Devlin & his research team at Google AI Language as presented in their paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. For your further reading on this, here’s the link to this paper.

BERT is bidirectional

Historically, language models could only read text input sequentially — either left-to-right or right-to-left — but couldn’t do both at the same time.

BERT is different that way, as it is designed to read in both directions at once, thanks to its Transformer Architecture. This capability, enabled by the introduction of Transformers, is known as bidirectionality.

Using this bidirectional capability, BERT is pre-trained on two different, but related, NLP tasks:

Masked Language Modeling and
Next Sentence Prediction.

Masked Language Modeling

The objective of Masked Language Model (MLM) training is to hide a word in a sentence and then have the program predict what word has been hidden (masked) based on the hidden word’s context.

Next Sentence Prediction

The objective of Next Sentence Prediction training is to have the program predict whether two given sentences have a logical, sequential connection or whether their relationship is simply random.

As a fact, BERT is trained on both MLM (50%) and NSP (50%) at the same time.

Brief on Transformers

As I said a while back, BERT’s training was made possible thanks to the novel Transformer architecture, which was first introduced by Google in 2017.

The transformer processes any given word in relation to all other words in a sentence, rather than processing them one at a time. As is the case below in example#1, for this sample sentence: The animal didn’t cross the street because ‘it’ was too wide, and here ‘it’ is masked. By looking at all surrounding words, the Transformer allows the BERT model to understand the full context of this.

This is contrasted against the traditional method of language processing, known as word embedding, in which previous models, like: GloVe and word2vec would map every single word to a vector, which represents only that word’s meaning. And because of this, these techniques fail at the context-heavy use cases, because all words are in some sense fixed to a vector or meaning.

BERT is also the first NLP technique to rely solely on self-attention mechanism, which is made possible by the bidirectional Transformers at the center of BERT’s design. This is significant because often, a word may change meaning as a sentence develops. Each word added augments the overall meaning of the word being focused on by the NLP algorithm.

For example, in example#2 above, when I change the context from an animal crossing the street, to a man (with the name Harry) crossing the river, the model understands the context change and predicts the masked word as ‘he’.

Chances are, you’re using it already!!

BERT helps Google better surface English results for nearly all searches since November of 2020. Here’s an example of how BERT helps Google better understand specific searches.

Pre-BERT Google surfaced information about getting a prescription filled. Post-BERT Google understands that “for someone” relates to picking up a prescription for someone else and the search results now help to answer that.

BERT excels at several functions that together make this possible, including:

The best part about BERT is that it’s open source, meaning anyone can use it. And so will we.

Seeing BERT in action

Now, without further delay, let’s go straight into performing hands-on with BERT. I will show you a couple of ways of doing this..

First one is Hugging-Face. Go to their website, and on the homepage itself you will find this InterFace API to query on BERT..

Second option to run this code in a Jupyter Notebook to load BERT and start querying:

Alright, now we are all set to start querying BERT. Idea is to type in a sentence and keep one of the words masked, like this: [MASK].

So, if you try: Hope you are having [MASK]!, BERT knows you are having fun 😛

Next, if I try: My name is Gopal and I live in New Delhi, [MASK]., BERT predicts the country where New Delhi is in. Cool right!!

BERT also has the natural gender bias, as a human would have. If I try: The man worked as a [MASK]. and then change it to: The woman worked as a [MASK]. we get different results, specific to relevant genders.

So, as you may see, the model predicts the job roles that are relevant to a man and woman, respectively.

Let me also freak you out a bit, with this: To save planet earth, humans must [MASK].

Although the confidence of these predictions is low, it’s still not an acceptable response we may expect from an AI model.

Conclusion

With this, we have come to end of this part#2 of our ongoing Transfer learning series. Hope you are liking it so far. Do share your feedback or any queries that you may have in the comments section below, and I’ll be more than happy to answer.

In the third part, we shall train a Fake News Detection Model, with BERT pre-trained model as the base, using Transfer Learning.

Brief about Skillcate

At Skillcate, we are on a mission to bring you application based machine learning education. We launch new machine learning projects every week. So, make sure to subscribe to our youtube channel and also hit that bell icon, so you get notified when our new ML Projects go live.

Shall be back soon with a new ML project. Until then, happy learning 🤗!!