NLP Language Models BERT, GPT2/3, T-NLG: Changing the rules of the game

Published in

Analytics Vidhya

7 min readAug 19, 2020

Summary: key concepts of popular language model capabilities

We all are aware about the current revolution in field of Artificial Intelligence(AI) and Natural Language Processing(NLP) is one of the major contributor.

For the NLP related tasks, where we build technique related to human and computer interaction, we first develop a language specific understanding in our machine so it can extract some context out of the training data. This is also the first basic things in our parenting, our babies first understand the language then we start giving complex task gradually.

In the conventional world, we need to nurture each baby individually but on other hand if you take example of of any subject like Physics, a lot of people contributed so far and we have predefined ecosystem like books, universities to pass the earned knowledge to next person. Our conventional NLP language model was similar like this only, everyone needs to develop their own language understanding using some technique but no one can leverage others work. The computer vision division of AI, already achieved this using their ImageNet object data set. This concept is called Transfer Learning. As per Wikipedia

Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

To reduce a lot of repetitive time intensive, cost unfriendly and compute intensive task, A lot of major companies was working on such language model, where someone can leverage their langua ge understanding but BERT from Google is major defining moment, which almost changed this industry. before that the popular one was ELMo and GPT.

Before going to these language model, we must understand few key concepts

Embedding we know that most of our algorithms can’t understand native languages and we need to provide some numerical representation and embedding is doing the same, making different numerical representations of the same text. It can be simple one like count based embedding like TF-IDF, prediction based or context based. Here we are only focused on context based.

No of Parameters for Neural Network All our language model use this term as a performance metric and more number of parameters is generally assumed more accurate one. It is typically the weights of the connections or parameters are learned during the training stage.

Transformers

This is where the story changed, this is not CNN and RNN, this is something totally different. Let’s go into more detail

suppose your model go through this tweet

https://twitter.com/narendramodi/status/1234500451850018818

Now, your model will be confused if Narendra Modi is talking to update these social media companies like ‘I quit’ or update his followers about his decision. Even any person with basic English knowledge, can be confused if they don’t put attention, i repeat ‘attention’. This is key concept from where Transformer architecture evolved, it create a ‘self attention’ layer while reading all corpus.

If you ever have any connection with Electronics or software encryption, you know the word encoder-decoder. The first one changes original input to some cryptic one and second one do the reverse i.e. cryptic to original.

This is the diagram from its white paper and here the key steps

The first one is encoder which has Multi-Head attention layer followed by feed forward neural network
Second one is decoder which has one additional layer ‘masked multi head attention’
Nx denote number of layers for both encoder and decoder
first we have stack of encoder layer where output of one layer will work as input of second
The attention layer of encoder check about context using query vector, key vector and value vector
Then encoder pass their understanding to next layer and so on
The final encoder output will pass to all decoders as key vector and query vector
It will first predict the first word as final output and take that first word as input of all decoders and predict next words
This process will be repeated till it will predict the last word
Terminate the loop :) remember your early programming days

Now we can discuss about popular language models

BERT

Bidirectional Encoder Representations from Transformers, Google

This is actual breakthrough in field of NLP pre-trained model and it can understand context like difference between ‘this painting is pretty ugly’ and ‘this watch is pretty’. Both sentence has word ‘pretty’ but BERT can understand difference context between two. As per official documentation

BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.

BERT mainly has two keyword a) bidirectional b) transformers. Transformers are already explained and bidirectional is implemented via masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. This is not new concept but BERT is the one who successfully implemented before anyone.

before BERT, ELMo was the technique used for context based learning but here the key advantage of BERT

What to do with BERT

Pre-Training : This is very compute intensive and only needed if you want to train by own for any language, Google already trained and provide two models a) BERT-Base b) BERT-Large
Fine tuning : This is task specific work where you can fine tune model. Tensorflow if the by default supported framework and PyTorch and Chainer non-official support also available

you can find implementation code from here and also can execute this notebook directly

Google Colaboratory

Edit description

colab.research.google.com

BERT is basically designed to fill the blank kind of activity and it support 340 millions of parameters

BERT major adoptions

ROBERTA FairSeq team, Facebook

This is something released in pyTorch, and as per their official documentation

RoBERTa builds on BERT’s language masking strategy and modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates.

AzureML-BERT Microsoft

Its cloud based adoption of BERT where Azure cloud can perform to end to end process, as per their website it has better metric than Google native

https://azure.microsoft.com/en-in/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

ALBERT: A Lite BERT Google

DistilBERT a distilled version of BERT: smaller, faster, cheaper and lighter

GPT-2/3

Generative Pretrained Transformer, OpenAI

An Elon Musk initiative, OpenAI, which also received 1 Billion investment from Microsoft. It has word generative in its name, as it was trained to predict the new token based on sequence of token, using unsupervised techniques.

Considering its content generation capabilities, The time it was released, the management told that they are not releasing its full version as open source, they are feared that it will be dangerous if will be used for fake news creation.

This pretrained model are mainly used content over internet, Wikipedia, Reddit and its basically developed to do content writing or generating new text. Its unidirectional language model. Unlike of BERT, its is mainly using decoder skills and generate new skills word by word.

GPT also released it’s a music generation module, used same GPT-2 to make all the music understanding

MuseNet

We've created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different…

openai.com

BERT is basically designed essay writing kind of activity and it support 1.5 billions of parameters. GPT 3 also announced which has more advance capabilities and its really a big discussion topic around the world.

T-NLG

Turing Natural Language Generation, Microsoft

Considering the recent development in field of language model, this is Microsoft bid to solve NLP tasks like conversation , language understanding, question answer, summarization etc. As per their claim, it is 17 billion parameter language model which needs a different Microsoft developed optimizer called ZeRO and a different Deep Speed optimization library called DeepSpeed. Using both, the model can be trained on multiple CPU.

This model is naturally solving the question and direct answer problem, very useful for AI enabled assistance. It can also answer without using context message, at this time, the model relies on knowledge gained during pre-training to generate an answer.

It support abstractive summarization like a human, not extractive where the summarization only reduce no of sentences. It can summarize multiple kind of document like email, excel etc.

Microsoft does not made most of things public here, so for this, I have taken most of content from source website.

This is USP from their website