Photo by Nong Vang on Unsplash

BERT for Everyone

How Google’s State-of-the-Art Natural Language Processing AI works, in layman’s terms

Andre Ye
Andre Ye
Feb 28 · 4 min read

In late 2019, Google unveiled the Bidirectional Encoder Representations from Transformers (BERT) model, which the company said would impact 1 in 10 queries in their search engine. Its neural network-based technique for natural language processing (NLP) broke several records for NLP problems. There’s been a lot of activity in the machine learning community about BERT — but I’ve yet to come upon a non-hyper-technical tutorial jam-packed with intricate technicalities.

With today’s programming developments, it’s not necessary to understand the technical mathematics behind techniques, especially if they are as complex as BERT. This article will intuitively explain the gist of how BERT works without subjecting the reader to an onslaught of equations.

Let’s get started!


NLP models need to be pretrained — it takes several years to get a solid grasp on any language, and even with the speedup computers offer, they can’t learn a language in a few minutes or even a day. Pretraining prior to BERT was limited to word embeddings that mapped each word to a vector with some aspects of its meaning. For example, ‘watermelon’ to ‘green’, ‘fruit’, or ‘seed’. The embeddings are trained on a massive unlabeled set of text, like all of Wikipedia, then stored in a library or package for use in a model, say, to recognize sentiment, allowing for models to achieve the knowledge from larger datasets without the time consumption.

However, word embedding models are, in general, not very powerful. Available word embeddings are trained on very shallow language modelling tasks — therefore, word embeddings are unable to capture combinations of words and context. Therefore, the word ‘bank’ might be understood as in ‘rob the bank’ but fail to be recognized in ‘the water overflowed the riverbank’.

Language modelling, at its core, is gathering the probability of a word occurring in a certain context — for instance, “it’s raining dogs and ___”, a language model would output ‘cat’ with the highest probability.

Language models are usually trained in the manner we read — from left to right. They are given a sequence of words and must predict the next word.

For example, if the network is given the starter “She walked her”, it may construct the remaining portion to form the sentence “She walked her dog”. This way, language models get a feel for how the text is written and is particularly helpful when generating sentences.

Some more complex language models such as bidirectional LSTM learn to read forward and backward; that is, being able to predict ‘dog’ in “She walked her…?” and being able to predict ‘She’ in “dog her walked…?”. While this does strengthen the models’ familiarity with the text, it cannot recognize the significance of words that require context before and after the word.

However, BERT provides a new way of modelling language. There is no need to train language models from left to right when there is no need to generate sentences — so instead of predicting the next word after a sequence of words like standard language models, BERT randomly covers words in a sentence and predicts them.

This way, instead of only predicting the word ahead based on previous context, BERT learns more complex aspects of the language.

For example, given the sentence “It’s raining cats and dogs”:

…BERT could learn off each of the following rows from the data:

This method forces the model to learn the context of the entire sentence.

BERT also uses a next sentence prediction task to pretrain the model for tasks in which the understanding of the relationship between two sentences is necessary, such as for question answering. BERT is fed two sentences — 50% of the time, the second sentence comes after the first and the other 50%, the second sentence has been randomly sampled.

After pre-training, BERT works like a neural network — it passes words through a series of transformer layers (equivalent to a hidden layer in a neural network) and returns the output.


BERT’s innovative method has outperformed the previous best NLP models on the following tasks:

  • Language Understanding
  • Natural Language Inference
  • Paraphrase Detection
  • Sentiment Analysis
  • Linguistic Acceptability Analysis
  • Semantic Similarity Analysis
  • Textual Entailment

BERT not only outperforms traditional word-embedding approached but also new methods like ELMo.


I hope this article has given you a general understanding of how BERT is able to perform so much better than other NLP models. Thanks for reading!

Andre Ye

Written by

Andre Ye

Fascinated by data science and AI applications in business. https://www.linkedin.com/in/andre-ye-501746150/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade