BERT: Google’s bidirectional transformer

Published in

knowledge-engineering-seminar

11 min readMay 21, 2020

It is commonly known that machine translation, text summarization or other natural language processing (NLP) tasks are quite challenging in the field of machine learning, which can be due to ambiguity of the text and different evolution of languages (syntax).

For a long time, approaches such as convolutional neural networks (CNN) and bidirectional recurrent neural network (BRNN) were used, but nowadays, even more sophisticated methods are employed for such tasks and

one of them is the attention mechanism.

[Figure 2] Convolutional neural network for classification

Attention (!)

Attention, as described in the Attention is all you need article, is a mapping of input query and key-value pairs into an output value. There are two basic implementations of this method, Scaled dot-product and Multi-head attention.

Scaled dot-product

This method uses input queries and key vectors of dimension dₖ and values of dimension dᵥ.

Multi-head attention

Multi-head attention, on the other hand, is a linear projection of h parallel Scaled dot-products of learned linear representation of input queries and key-value pairs.

BERT

BERT stands for Bidirectional Encoder Representation from Transformer, and by default comes in two versions:

BASE version has:
• 12 layers (Transformers)
• 768 hidden layers
• 12 self-attentions
• 110M parameters in total

LARGE version has:
• 24 layers (Transformers)
• 1024 hidden layers
• 16 self-attentions
• 340M parameters in total

Transformer is a model that is used primarily in natural language processing. It is designed to solve sentence-to-sentence tasks and it learns the representation of both input and output whilst it does not require sentence alignment or convolution. The transformer contains pairs of encoders and decoders.

Encoder is composed of N (6 for BASE, 12 for LARGE) identical layers, where each has two sub-layers. One is the multi-head self-attention, which is then connected to a feed-forward network.

Decoder is also composed of N (6 for BASE, 12 for LARGE) identical layers, where each has a sub-layers of multi-head self-attention and feed-forward net, but in addition has third sub-layer, which performs multi-head attention over the output of the encoder stack.

Self-attentions are basically multi-headed attentions taking different parts of the same sequence.

The model takes an input sequence (x₁, x₂, .., xₙ) and transforms it via its encoders into continuous sequence representation z, which is then fed into the decoder. Together with the attention mechanism, the model no longer encodes the whole input into one fixed-size vector, but rather works with parts of the input at each step. The model is also auto-regressive, which means that the next predictions are based on both input and the previous predictions. The combination of all those methods provides more solid results, due to the fact that context plays a much bigger role here.

Since the BERTs layers utilize its bi-directional connections (see figure above), it doesn’t have the restricting feature of seeing only one side of tokens context, which may be the biggest bottleneck of other pre-trained language models, thus providing more accurate results.

There are lots of BERTs offsprings, such as Albert: A lite BERT, RoBERTa: A Robustly Optimized BERT Pretraining Approach and others.

Training

BERT is trained in quite an unusual way; It’s trained using masked language modelling method (MLM), which takes the input sequence and masks out some of its tokens. The input looks something like this:

[Figure 7] Masked input sentence from the original BERT paper

Thanks to that, BERT can utilize its bidirectional attention even more, thus being more agile in tasks where the input sentence may not be comprehensive.

After the base training period ends, the model can be “fine-tuned” to do other specific tasks such as POS-tagging, tasks from GLUE benchmark or many others.

POS-tagging stands for Part Of Speech. In practice, the evaluating model classifies each token based on its context and determines whether the token is a noun, adjective, conjunction, verb or something else…
Some are even able to determine plurality, type of adverb and many other features.

Fine-tuning is basically preserving the originally trained weights of a model and applying them to a new model with different output layer, so the model can adapt to domain-specific tasks. Thanks to this method, training for such task requires significantly smaller datasets since it is not forced to learn “from scratch”.

Measuring the success

As written above, BERT is able to provide more accurate results than some other methods. But how are the results measured?

One way to measure the “greatness” of the model is The General Language Understanding Evaluation (GLUE) benchmark, which is a set of nine NLP-challenging tasks. Let’s take a look at them.

The final score is computed as an average score of all disciplines below.

The Semantic Textual Similarity Benchmark (STS-B) -
Is the only regression task here. The goal is to predict how much, on a scale of 1 to 5, are the two input sentences similar. Then those scores are used to determine Pearson and Spearman correlation coefficients, which then represent the similarity. You can find more about it here.
The Corpus of Linguistic Acceptability (CoLA)
This task is based on classifying the acceptability (grammaticality) of an input sequence. The whole dataset consists of 40657 sentences from 23 publications. But, for the evaluation is used a smaller, public dataset, which is split into 9594 train sentences and 1063 test sentences. The result is calculated as the Matthews correlation coefficient (MCC), which is based on numbers of true-positives (TP), true-negatives (TN), false-positives (FP) and false-negatives (FN).

[Figure 8] MCC formula based on (True/False) Negatives and (True/False) Positives

The Stanford Sentiment Treebank (SST2)
This task is about classifying whether the sentiment of a sentence is positive or not. Dataset contains mostly movie and restaurant reviews and the final score is calculated as accuracy.
The Microsoft Research Paraphrase Corpus (MRPC)
This task is about classifying, whether or not two sentences are semantically equivalent. Since the classes in the dataset are imbalanced, the result is composed of both accuracy and F1-score.

[Figure 9] F1-score formula

The Quora Question Pairs (QQP)
The goal here is to determine whether provided questions have the same meaning. Since the negative class is dominant, the result is also accuracy and F1-score.
The Multi-Genre Natural Language Inference Corpus (MNLI)
This task is tested on crowd-sourced collection of sentence pairs, gathered from various sources, such as transcribed speech, fiction and government reports. Given these two sentences (premise and hypothesis), the task is to predict, whether the premise entails or contradicts the hypothesis, or neither. Results in this and all subsequent task are measured as accuracy.
The Stanford Question Answering Dataset (QNLI)
Here, the task is to find what part of text answers the asked question.
I find this particular discipline quite interesting since it combines both understanding of the text and finding the answer if present.
The Winograd Schema Challenge (WNLI)
In this task, we test if the second sentence entails the first. For this, we use a pair of sentences. First sentence contains an ambiguous pronoun and the second sentence is made by replacing said pronoun with each of the nouns from pre-defined list.
The Recognizing Textual Entailment (RTE)
Does the second sentence entail the first one, testing for textual entailment, rather than logical entailment?
Textual entailment has a more relaxed definition, by only testing, if a human reading the first sentence would infer that the second one is most likely true.

In 2019, the BERT model ended up with a higher score than any other state of the art models.

[Figure 10] GLUE score comparison from original BERT paper

The current GLUE leader-boards can be found here.

Example Time

Since BERT scored high in the question-answering task, I’ve decided to create my own BERT-based personal answering bot in Python.

So, how can we create such a thing?

First of all, we need to figure out, how can we obtain information related to our question. What do we do, when we’re looking up information?

That’s right, we use a search engine…

Since the search engines are constantly crawling all kinds of websites, at best the whole Internet, and index its content, we can use said search engines to query the question and recommend us related websites which might contain the answer to our question.

[Figure 13] Average count of clicks on Google search result based on a position

From the analysis of 5 million google search results, we can see that the information on first ten pages is probably the most relevant to the query. So, at first, I decided to work with those ten pages, but since there were a lot of paragraphs, the computation was quite slow. I decreased the number of pages to five, which improved the speed of my answering engine considerably.

''' Obtaining question related pages '''from googlesearch import searchquestion = "What does the fox say?"urls = [uri for uri in search(question,  # input text 
                              tld='com', # top-level domain
                              lang='en', # language to query by
                              start=0,   # first page returned
                              stop=5)    # last page returned
]

In the next step, we need to extract the text from the pages, alas we can’t simply strip HTML tags and use the whole text as is. The main problem is that web pages contain obsolete text that might distract our answering engine.

One such example can be cookie policy pop-up, which contains a lot of words and is completely useless to us.

So, given that, I have decided to extract only paragraphs from each website as content and then filter out paragraphs with less than 100 characters. That gives us a sufficient amount of text to search for an answer in.

''' Extracting paragraphs '''from lxml import html
import re
import requeststext = []for i in range(len(urls)):
   content = requests.get(urls[i]).text
   paragraphs = html.fromstring(content).findall('.//p') # all <p>
   text += [ # removing non words using regexp
           re.sub("\w*[^0-9a-zA-Z.,;' ]\w*/g", "", p.text_content()) 
           for p in paragraphs if len(p.text_content()) > 100][:5]

Since the questions I’m querying are not complex, I was able to obtain the answer in first few paragraphs. I’ve also limited the number of paragraphs used, so the engine works even faster.

In the next step, with help from BERT, we find the answer in the text for this task.

I used pre-trained large BERT, which was also fine-tuned for SQUAD in English.

This model takes a pair consisting of question and reference text, which should contain the answer, and it returns vectors of start and end scores of each token.

The higher the score, the higher the probability for the token to be the starting/ending token of an answer. So, after BERT evaluates the text, we just find the tokens with highest score of each vector and get the text in range between them.

from transformers import *
import tensorflow as tf# encode input
inputs = tokenizer.encode_plus(question, 
                               text, 
                               add_special_tokens=True,
                               return_tensors="tf")# get inputs as array
input_ids = inputs["input_ids"].numpy()[0]# convert input into tokens
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)# evaluate scores
answer_start_scores, answer_end_scores = model(inputs)# get indexes of most-likely beginning and end
answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
answer_end = (tf.argmax(answer_end_scores, axis=1) + 1).numpy()[0]# detokenize answer
answer = tokenizer.convert_tokens_to_string(
             tokenizer.convert_ids_to_tokens(
              input_ids[answer_start:answer_end]
         ))print(f"Question: {question}")
print(f"Answer: {answer}\n")

The only limitation is that BERT can process input with maximal length of 512 tokens. I solved this limitation with batch-like approach. I simply evaluated each paragraph, took the best answer, and it’s respective start-end scores, and then compared it to others to return the one with best score.

def get_score(question, text):
    '''
    BERT evaluation from cell above
    '''
    start = tf.math.reduce_max(start_scores, axis=1)\.numpy()[0]
    end = tf.math.reduce_max(end_scores, axis=1).numpy()[0]
    
    return start, end, answer

Used together with following function:

import progressbar
def ask(question, results=1):
    urls = [uri for uri in  
            search(question, 
                   tld='com', 
                   lang='en',
                   start=0,
                   stop=5)]
    text = []
    scores = []
    print("Searching the internet...")
    
    with progressbar.ProgressBar(max_value=10) as bar:
          '''
          paragraph extraction
          '''
          bar.update(i)
    
    print("Looking for an answer...")        
    
    for p in text:
        scores.append(get_score(question, p))
    
    ranked = sorted(scores ,key=lambda x: x[1], reverse=True)[:results] 
    answers = [x[2] for x in ranked]
    return answers

And the final result looks like this:

The whole example in the form of a notebook can be found here.

I also thought about some improvements, which may help the evaluation speed. For example, early stopping, in a sense that we evaluate paragraphs in the first page, save the results, and then look at and evaluate the next page and see if there are answers with a better score. Nevertheless, I like the naivety of this solution for example purposes much more.

Final thoughts

I personally think that including bidirectional context helped to solve some NLP tasks and significantly changed the course of solving them more efficiently, but at the same time, some tasks seem harder or merely impossible, for example, text generation is still a big problem for BERT.

Sources

How do transformers work in NLP?
BERT paper
GLUE: A MULTI-TASK BENCHMARK
BERT explanation video
Attention is all you need

Image sources

[Figure 1] Recurrent neural network
Fundamentals of Deep Learning, Dishashree Gupta
[Figure 2] Convolutional neural network for classification
Deep Learning applied to NLP, Marc Moreno Lopez
[Figure 3] Scaled dot-product
Attention is all you need, Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
[Figure 4] Multi-head attention
Attention is all you need, Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
[Figure 5] 1 stacked transformer
Attention is all you need, Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
[Figure 6] BERT architecture
BERT original paper, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
[Figure 7] Masked input sentence from original BERT paper
BERT original paper, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
[Figure 8] MCC formula based on (True/False) Negatives and (True/False) Positives
Matthews correlation coefficient, Wikipedia
[Figure 9] F1-score formula
Wikipedia
[Figure 10] GLUE score comparison from original BERT paper
BERT original paper, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
[Figure 11] Example time inspired by Adventure time logo
[Figure 12] Google search
[Figure 13] Average count of clicks on Google search result based on position
We analyzed 5 million Google Search Results Here’s What We Learned About Organic Click Through Rate, Brian Dean
[Figure 14] BERT for question answering
[Figure 15] The example in action