Welcome BERT - State of the Art Language Model for NLP by Google

Nikola Basta
Arteos AI
Published in
8 min readMay 31, 2020

One of the most significant breakthroughs in the search engine world since Google released RankBrain in 2015.

What is BERT?

Bidirectional, Encoder, Representations from Transformers is a state-of-the-art model solution. It can be used in a wide variety of NLP tasks such as:

  1. Classification: sentiment analysis by adding a classification layer on top of the Transformer output.
  2. Q&A: the software receives a question and is required to mark the answer. Using BERT, the model marks the beginning and the end of the answer.
  3. Named Entity Recognition (NER): the software receives a text and is required to mark the various types of entities such as Person, Organization, Date, etc.. that appear in the text.

To go deeper in BERT’s definition, it is Google’s neural network-based lates model for NLP pre-training, which is opened-sourced for everyone, last year. More information about this topic on the Google A.I. blog.

BERT can help computers understand the language a bit more as humans do.

Applying bidirectional training to language modeling is a BERT’s critical technical innovation. In contrast to previous models, this is a huge leap forward. With a novel technique named Masked L.M. (MLM), this training becomes a reality and had a more profound sense of language context and flow than single-direction language models.

How BERT works?

The core of BERT is in the use of Transformer — an attention mechanism that learns contextual relations between words in a text. The Transformer includes two separate mechanisms:

  1. reading the text input with encoding and
  2. producing a prediction for the task with decoding.

Having in mind that generating a language model is BERT’s main goal, only the encoder mechanism is necessary and developed so far which means that it is still not suitable for creating ChatBots etc.. To learn more about the Transformer, take a look at the latest research paper by Google.

The legacy directional models of text understanding read the text input sequentially. The BERT encoder reads the entire sequence of words at once and this is why it is considered bidirectional.

The input in the neural network is a sequence of tokens. They are firstly embedded into vectors and then processed in the neural network.

The output of a neural network is a sequence of vectors of size H. Each output vector is in line with an input token with the same index.

The biggest challenge in training the language model is defining a prediction goal. Almost all models before BERT predict the next word in a sequence (e.g., “The father is playing basketball with his ___”).

Context learning is limited by the directional approach.

BERT found I way how to overcome these challenges:

  1. Masked L.M. (MLM)
  2. Next Sentence Prediction (NSP)

Masked LM

Masked LM started with replacing 15% of the randomly picked words in each sequence with a [MASK]. After that, the model attempts to predict the original value, based on the context provided by the other, non-masked, words in the sequence. Speaking in technical steps, the prediction requires the following:

  1. One classification layer on top of the output of an encoder
  2. Perform the multiplication of the output vectors by the embedding matrix. After that, we can perform transformation into dimensions of vocabulary
  3. Calculating the probability of each word in the vocabulary with softmax.

The BERT loss function takes into consideration only the prediction of the masked values and ignores others. Following, the model converges slower than directional models.

Next Sentence Prediction

In the training process, the main idea is that the model receives pairs of sentences as input and learns how to accurately predict that the second sentence in the pair is actually subsequent part in the original document. We take 50% of the inputs in which the second sentence is the subsequent sentence from the original document, while the rest is a random sentence from the corpus. The assumption is that the model will detect and disconnect the random sentence from the first sentence.

How model should distinguish between the two sentences? The input should be processed in the following way:

  1. Input a [CLS] token at the beginning of the first sentence
  2. Input a [SEP] token at the end of each sentence
  3. While sentence embedding, we added Sentence A and Sentence B
  4. Finally, a positional embedding is added also to indicate a position in the sequence

To have an accurate prediction that the second sentence is connected to the first, we need also the following steps:

  1. The entire input sentence goes through the Transformer model
  2. The [CLS] token’s output is transformed into a 2×1 shaped vector using a simple classification layer
  3. Calculating the probability of IsNextSequence with Softmax

In the training phase, Masked L.M. and Next Sentence Prediction are trained together, to minimize the combined loss function of the two strategies.

Let’s take a look at some examples.

Google said BERT helps better understand the context of words in searches and match those queries with more relevant results.

Example 1:

Query: “2019 brazil traveler to usa need a visa”

The word “to” and its relationship to the other words in query are important for understanding the meaning. Previously, Google wouldn’t understand the importance of this connection and would return results about U.S. citizens traveling to Brazil.

With this new model, Search can grasp this and know that the very common word “to” actually matters a lot here and that they can provide a much more relevant result for this query.

Example 2:

Query: “do estheticians stand a lot at work”

Previously, it would have matched the term “stand-alone” with the word “stand” used in the query. BERT models can understand that “stand” is related to the concept of the physical demands of a job, and displays a more useful response

Example 3:

Query: “Can you get medicine for someone pharmacy”

Google can understand a query more like a human to show a more relevant result on a search for

Featured snippet Example.

Query: “parking on a hill with no curb”

Google said, “We placed too much importance on the word “curb” and ignored the word “no”, not understanding how critical that word was to appropriately responding to this query.

Why should we care about this?

Why should we, really, and how this will affect our lives?

*BERT affects SEO*

This huge shift will definitely have an impact on SEO. Have in mind that BERT will not help any poorly-written websites, though. After this change has been made, content creators should focus more on long-tail terms in their content, but not exclusively. They should mix short and long keywords for the best results. Work on quality and context. The length of your keywords should be secondary. Next, create in-depth content that explores topics and details with high value-added to the audience. BERT will have more information to process this way and to determine the context of your site.

*BERT affects Content Marketing*

Delivering higher content quality is now relevant more than ever. Targeting your readers with quality information is now on the big test due to Google and BERT. They are working hard to understand the context and intent of your message and user search queries. Content writes must write for the people and not for search engines, as it should be from the start. Content should be “more human” as they stated, using the right terms, tone, and language which closely correlate with your audience.

*BERT affects your Website*

Your organic traffic will slump, especially when we speak about feature snippets and voice search. Voice search is the new must-have in SEO and Digital marketing. As further technological advancements, it will become even more important. With BERT and their intentions about nurturing voice SEO and Technology, we should all be about content and user intent centric. Your actions should go in the way of content quality improvements-more current, relevant, and actionable.

*BERT affects Advertisers*

After SEO and Content Marketing, we can surely expect that influence transfers also to pay-per-click advertising. After all, your current strategy highly relies on keywords and user intent. PPC Advertisers should now switch focus on Dynamic Search Ads with the idea to leverage Google’s Solutions and create better and more relevant ads. Over time, with more data, you will have more relevant dynamic ads that will work well and be recognized better by BERT which will influence your ROI and PPC campaigns. Apart from this, the model can be applied to other languages also. If you have a successful PPC campaign for the US market, model will learn and improve feature snippet results in other countries and languages.

Also, have in mind that BERT didn’t replace old one, RankBrain, but rather it is an additional method for understanding content and user queries. Google explained that there are a lot of ways that it can understand what the language in your query means, how it relates to the content on the web and what signals you are looking for.

Conclusion

Google is getting smarter.

Without a doubt, BERT is a breakthrough in the Machine Learning use for Naural language processing, and the fact that it’s open-sourced and allows fast fine-tuning, practical application range is endless. If you want to dive deeper into tech and code, take a look at the BERT source code with more than 103 languages.

We have to be aware that we are living in the age of Global Automation and Voice Search Optimization. Marketers need to respond quickly and follow the innovations or they will be left in the dust by competitors. Maybe the first time in human and machine history, the algorithm has a chance to understand what people mean, even when they use complex and often confusing phrases. Greate change for users, new milestones, and target to reach for content creators, SEO professionals, and PPC advertisers.

Customer-centric attitude with more useful, relevant web pages, dynamic paid ads focused on delivering the right information, and result is the future we are heading to.

Until next time,

Nikola Basta

For more interesting topics follow me on Linkedin

--

--

Nikola Basta
Arteos AI

Optimizing business processes, minimizing costs and maximizing profit using machine learning and deep learning solutions.