State-of-the-art AI solutions: (1) Google BERT, an AI model that understands language better than humans

AI Network Dev Team
AI Network
Published in
8 min readFeb 1, 2019

Recently, artificial intelligence has increasingly been applied to business and daily life, most notably in the areas of finance, healthcare, HR, music, and publishing (link). The success of Artificial Intelligence in these areas has largely been driven by new, innovative AI solutions to real-world problems.

To help you keep up-to-date with the latest trends in AI solutions, we here at AI ​​Network are working on a series of blog posts, entitled “State-Of-The-Art AI Solutions.” Today, in the first post of the series, we will be talking about BERT (Bidirectional Encoder Representations from Transformers), a Google-powered AI language model, which has recently gained public attention by showing better accuracy than humans in some performance evaluations (link).

The BERT dev team first published their original paper on October 11, 2018. About three weeks later, they released their source code and pre-trained model as promised (link). Interestingly, both NLP developers and researchers alike expressed a mix of excitement and concern over BERT’s initial study.

While there was excitement surrounding BERT’s amazing performance metrics, BERT also raised concern in the NLP community due to the BERT dev team’s unconventional approach. While conventional NLP models prioritise either flexibility or performance, the BERT dev team used a general-purpose model architecture and universal learning data that increased both performance and flexibility simultaneously. They achieved this by investing as many machine resources as possible into their model. In order to achieve BERT-level performance, it seemed that those studying AI ​​language models would naturally have to re-examine their approach and reconsider how to set up their research environments.

In this article, we will briefly review the approaches, ideas, and performance evaluation results of BERT. We will also examine what messages these results send to the natural language community and AI developers, and how this relates to AI Network.

What kind of solution is BERT?

Approach

The BERT developers’ approach, which can be read in this paper (link) and this Reddit post (link), can be summarized as (1) designing a general-purpose solution, (2) implementing it in a scalable way, and (3) building models with as many machine resources as possible to maximize performance.

In the BERT paper, 11 NLP tasks were used to evaluate performance. Only one pre-trained model was used in all tasks. This deviates from the common practice before BERT, where task-specific models were created for each task. Instead, fine-tuning was applied to the pre-trained BERT model before performing each new task.

The BERT team’s model architecture is based on Transformer, which is a general-purpose deep-learning module architecture that Google released in 2017. One of the biggest advantages of using Transformer is it’s increased learning speed, which is achieved through parallel processing (link). The size of the BERT models used in the paper were as follows:

An ANN of hundreds of millions (110M, 340M) of parameters is huge in comparison to other popular solutions. To put these numbers in context, AlphaGo’s Policy Network (link) has about 4.6M parameters, and ResNet-50 (link), a popular ANN for image recognition, is known to consist of about 25M parameters.

With BERT’s model architecture, they used two well-known universal corpus as pre-training data: BooksCorpus (800M words) and English Wikipedia (2,500M words). If task-specific data were used for each task, we could expect much better results than those published in the paper.

The size of the machine resources used will be discussed further in the Source & Resource section.

* The material presented in this paper refers mostly to BERT papers (link) and Reddit posts (link), including figures and tables. For more information, please refer to the original texts.

Key Features

Bidirectional Model

Fig. 1 (excerpt from the paper) shows the pre-training model architectures used in BERT, OpenAI GPT, and ELMo. While the OpenAI GPT uses a left-to-right Transformer and the ELMo uses independent left-to-right and right-to-left models in combination, BERT uses a single bidirectional model that sees both left and right contexts simultaneously.

Masked LM

Fig 2. Masked LM (excerpt from the paper).

If only the deep bidirectional model described above was used, there would have been a risk of cycles being created inside the model, thus making training itself meaningless. Generally, when the number of hidden layers in a Neural Network is larger than the number of input parameters, the learned result is likely to become an Identity Function, i.e., the model outputs the same values as it’s input values. To address this problem, the BERT developers have devised a technique called Masked LM (see Fig. 2), which leverages Denoising Autoencoders to corrupt input data on purpose (link). That is, a word is randomly selected from the input word array, replaced with [MASK] with a probability of 80%, replaced with a random word with a probability of 10%, or used without changing with the remaining probability of 10%.

Next Sentence Prediction

Fig 3. Next sentence prediction (excerpt from the paper).

Many of the tasks used for performance evaluation need to understand relationships between sentences. As such, the BERT developers used models capable of predicting such relationships between sentences. As shown in Fig. 3, the sentences were extracted from the monolingual corpus, labelled with IsNext if the sentence pairs were actually connected, and labelled with NotNext otherwise.

Performance Evaluation

GLUE

The GLUE (General Language Understanding Evaluation) dataset is a collection of various natural language processing tasks designed to objectively compare and evaluate natural language processing solutions.

Table 1 (excerpt from the paper) shows evaluation results for the GLUE dataset. In summary, BERT-Base and BERT-Large showed better results than the best known solutions across all tasks, improving the average performance by 4.4% and 6.7%, respectively.

SQuAD

The Standing Question Answering Dataset (SQuAD) consists of over 100k question/answer pairs derived from crowd-sourcing. Given a paragraph and a set of questions, the task is to find the answers from segments of the paragraph corresponding to the questions.

Table 2 (excerpt from the paper) shows the evaluation results for SQuAD. In summary, BERT showed results outperforming the best known systems in both ensemble and single environments. BERT also outperformed humans.

CoNLL-2003

CoNLL-2003 dataset consists of 200k learning words, each of which is annotated with Person, Organization, Location, Miscellaneous, or Other (non-named entity).

Table 3 (excerpt from the paper) shows the results of CoNLL-2003. You can see that BERT-Large showed better results than other systems.

SWAG

The Situations With Adversarial Generations (SWAG) dataset consists of 113k ‘sentence pair creation problems’. The task is to pick the most natural sentence from four choices following a provided initial sentence.

Table 4 (excerpt from the paper) shows the evaluation results for the SWAG dataset. The results show that BERT-Large improves the performance of ESIM + ELMo by 27.1%. BERT-Large also outperformed human experts.

Training Steps and Performance

To see the of training steps on performance, the BERT dev team compared performance across various numbers of training steps using the BERT-Base model based on Masked LM and Left-to-Right architectures, respectively.

The above graph (excerpt from the paper) shows their findings . The implications of this experiment can be summarized as follows:

  1. Question: Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy?
    Answer: Yes (as you can see in the convergence graph).
  2. Question: Does the pre-training of a Masked ML model converge slower than the pre-training of a Left-to-Right model?
    Answer: Yes (see the graph). However, in terms of absolute performance, Masked LM always outperformed the Left-to-Right model, except for the beginning.

Source & Resource

The BERT dev team said that it took about four days to complete model pre-training. The machine spec they used is as follows:

  • BERT-Base: 4 Cloud TPUs (16 TPU chips total)
  • BERT-Large: 16 Cloud TPUs (64 TPU chips total)

This is a huge amount of hardware, and the BERT team say that it could have taken more than a year to pre-train if they had used just GPUs like TESLA P100.

As mentioned earlier, the basic philosophy of the BERT developers is

  • designing a general-purpose solution
  • implementing it in a scalable fashion
  • building models with as many machine resources as possible to maximize performance.

The success of this approach implies one important shift in the NLP research and development field — the development of NLP AI solutions does not belong to NLP experts anymore. Instead it is now open to anyone who has appropriate coding skills and machine resource management abilities.

Another noteworthy point is that the positive relationship between the model size (or the amount of resources) and performance, which is easily noticeable by the performance gap between BERT-Base and BERT-Large, indicates AI solutions’ increasing dependency on machine resources. This increased dependence on constantly augmented machine resources will become even more critical to the competitiveness of research organizations and companies going forward.

As mentioned earlier, since the BERT dev team have already released their source code and pre-trained model to the public (link), anyone who is interested can now use BERT . However, this doesn’t mean that anyone can train a new BERT model; Only those who have adequate resources can do so.

This is where AI Network would like to contribute. The ultimate goal of AI ​​Network is to (1) provide everyone interested with cost-effective resources and (2) provide a platform (aka Open Resource Platform) for such services that connect user’s source code to well-fitting runtime environments. For a more detailed introduction to Open Resource, read the following post:

That’s it for today’s posting. Hope you stay tuned to this channel. Thank you.

Links

--

--