From BERT to ALBERT: Pre-trained Language Models

Published in

DoxaStar

5 min readOct 18, 2019

Fig1. pre-trained language models: From Bert to Albert. source

In the last two years, transfer learning in NLP has significantly improved the state-of-the-art on many NLP tasks, thanks to pre-trained language models and especially those based on Trnasformer.

In June 2017, Transformer Network was introduced, since that many transformer-based models have been proposed for pre-training language models in order to be fine-tuned in many different NLP tasks. Transformer was proposed to model sequence-to-sequence tasks such as translation from one language to another, but its attention mechanism is efficient so one could use its encoder (Bi-directional LM) or decoder (Unidirectional LM) for language modelling.

The emerging of these pre-trained models has marked a change point in the way we do Deep NLP which led many researchers to say that we have arrived ImageNet moment **in NLP.

**These pre-trained models have the same impact of pretrained ImageNet models in computer vision, but these models don’t need labeled data.

Static Word Embedding as History

Static Word Embeddings, which had been essential components for the success in deep learning for NLP, are on their way to disappear. W2vec (2013), Glove(2014) and FastText(2015) are no longer needed as they give a static representation for each word and disregard the context.

If you have these two sentences:

Apple stock is rising. — — — — — — I don’t want this apple.

Static word embeddings give the same representation of apple regardless the different contexts and meanings. Taking the context into account implies that we can’t use a static dictionary of word vectors, so we need to use a function (=model) to predict the representation of each word in its context.

ELMO was among the first efforts to make contextualized word embeddings followed by ULMFIT which had proposed to use pre-trained language model and fine-tune it for the downstream tasks. UMFIT was based on the state-of-the-art language model at that time which is LSTM_based model.

Transfomer has been proven to be more efficient and faster than LSTM or CNN for langauge modeling and thus the following advances in this domain will rely on this architecture.

Transformer-Based Pre-tained Models

GPT (OpenAI Transformer) is the first Transformer_based pre-trained language model, it uses the decoder of the Transformer to model the language as it is an autoregressive model where the model predicts the next word according to its previous context. GPT has shown a good performance on many downstream tasks.

Fig2. shows the timeline of emerging almost all pre-trained language models.

Fig2. TimeLine: Emerging of Pre-trained Language Models (source)

After three months BERT from Google was introduced, it uses the encoder of the Transformer to model the left and right contexts of the word, so it is supposed to get a better understanding of the context, but the problem is that we can’t model this Bi-directionality with the language model task so a new task called Masked Language model was proposed, instead of predicting the next word in the language model task , few words are masked and the model is asked to predict only these masked words. BERT has also been trained to predict the next sentence (binary classification).

After three months, XLM from Facebook enhanced BERT for Cross-lingual Language Model. A month later, GPT-2 from OpenAI scaled up the GPT in terms of the number of model parameters and training data and it showed a good ability to generate human-like text so OpenAI didn’t publish the large model.

In the next few months, MT-DNN from Microsoft has outperformed GPT-2 on different NLP tasks, MT-DNN is a BERT-Based model, it extends BERT by using multi-task training.

The next month June 2019, Carnegie Mellon University and Google Brain introduced XLNET which proposes a new task to predict the bidirectional context instead of the masked Language task in BERT, it is a permutation language in which we make some permutations of each sentence so the two contexts will be taken into consideration.

In July 2019 Roberta has been introduced, it is like a lite version of BERT but it has fewer parameters and better performance as it removes the training on the sentence classification task.

In september 2019, MegatronLM (Nvidia), AlBERT(Google), StructBERT (Alibaba), CTRL(SalesForce),DistilBERT (Hugging Face) have arrived.

DistilBERT reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. MegatronLM is a scaled up the transformer model, it’s 24 times bigger than BERT. CRTL is a 1.63 billion-parameter conditional generative transformer language model. StructBERT extends BERT by incorporating language structures into pre-training.

ALBERT ,a lite version of BERT, establishes new state-of-the-art results while having fewer parameters compared to BERT.

Conclusion

Five pre-trained language models were introduced in September 2019, most big tech companies work on this. while Megatron (Nvidia) scales up the size of the model to achieve good performance, ALBERT (Google) reduces the model size and achieving better results which makes me think that small groups and startups can play a role in developing such models.

In early 2021 Google Brain has introduced Google Switch Transformers.