Understanding BERT Variants: Part 1

Overviewing ALBERT, RoBERTa & ELECTRA

Mehul Gupta
Data Science in your pocket

--

It was one hell of a ride discussing the past 2 blogs on Transformers & BERT. Since the inception of BERT, a number of variants came up trying to cover a few issues found with the original BERT.

  • Consumes a lot of time & resources to train
  • Humongous in size due to 110 million parameters
  • High inference time
  • BERT is considered to be ‘less efficiently’ trained by some researchers. They think some better training techniques could have been used.

Note: If any terminology looks ghostly, to refer to Transformer & BERT

People soon came out with variants as good as the original BERT in terms of performance with a few tweaks. We will be exploring some significant versions today starting off with

ALBERT

Most of the problems in BERT were due to the huge number of parameters that are to be trained making it slow & bulky. ALBERT tried slicing down the total of 110 Million parameters to 12 Million (~1/10th of the original BERT Size) & hence better for real-world deployment & speedy by following the below strategies

Cross Parameter Sharing

If you remember, we have 6 blocks in BERT with each block consisting of:

Multi Head-Attention layer

Normalization

Feed Forward Layer

Normalization

Now, while training BERT, can the parameters of one of these blocks shared with others (hence avoiding training the other 5 blocks)?

This is what Cross Parameter Sharing in ALBERT where we share the same set of parameters between all the 6 blocks hence saving on a number of parameters to be trained therefore smaller size & early training with low inference time.

This sharing can be done in many ways

  • All shared: Weights for both Feed Forward Network & Attention layer
  • Only Feed Forward Network: Here, weights for only FFN are shared & the attention layer is trained separately
  • Only Attention: Weights for only Attention layer is shared & FFN is trained separately for each block

Factorized Embedding Layer Parameterization

The below diagram should look familiar now !!

As d_model = 768 for BERT (embedding dimension input/output for Encoder block, check out my previous post on BERT), to project this, the input embedding ‘learned’ (pink box in the diagram) are to be of the same dimension as whatever it outputs become input for Encoder block. So, as the vocabulary used in BERT has 3000 words (unique words in the pre-training set used), this input embedding matrix = 3000x 768 (one embedding for each word).

Now, if I wish to change d_model = 768 to say 1500, this will also change the size for input embedding & as both embeddings get trained, we may wish to avoid training a 3000x1500 matrix at input embedding as we won’t be using this embedding anyway(the pink box).

Can this be avoided somehow?

By matrix factorization. The idea is simple. In Bert what we do:

Input Embedding (3000x768) →Encoder →Output(3000x768)

Now, in ALBERT, we would be doing instead

IE (3000xN) X Temp(Nx768) →Encoder →Output(3000x768)

Hence breaking the 3000x768 matrix into 2 parts 3000xN & Nx768 will help in reducing the Input Embedding that has to be learned. This N can be kept small say 128 i.e. N<<768

In this way using Cross Parameter Sharing & Factorized Embedding Layer Parameterization, ALBERT becomes almost 1/10th of the size of BERT. But it isn’t trained similar to BERT.

Training ALBERT

Recollecting from the last post, BERT was pre-trained on 2 problems:

Masked Language Model (i.e. MLM predicting randomly masked tokens in the input sequence)

NSP (Next sentence prediction ; binary classifier whether next sentence is related to 1st or not in an input sequence)

Though ALBERT follows MLM, it ignores NSP as the researchers believed that

It is an easy task compared to MLM & hence not a great add on in pre-training

It includes 2 tasks together: Topic prediction (the model predicts the theme of the two sentence irrespective of the order of sentences. Hence, may fail when we swap two sentences as theme would still remain the same)+ Coherence prediction(how correlated the two sentences are) which can be a lot to do in a single task. Also, this may fail on inputs like ‘He plays so well. He is a footballer’ where the sentences are swapped

It instead follows Sentence order prediction (SOP) which is again a binary classifier aim to detect whether the order of the two-sentence given is correct or swapped. This becomes useful as it depends on inter-sentence coherence ignoring topic prediction. For example:

He took an exam. It was tough

For this case, the model should detect a positive as the 2 sentences are make sense in this order

It was tough. He took an exam

Here, as the order got swapped, the model should predict a negative.

Now, as we trained NSP, SOP has to be trained similarly. The dataset is easy to prepare where from a document, we can extract sentence pairs & swap some of them.

Moving onto,

RoBERTa

It stands for a Robust optimized BERT pre-training approach. The researchers who brought it out believed BERT was ‘under-trained’ & can be improved by following the below changes while pre-training:

Dynamic Masking of tokens

So taking a step back & try remembering how MLM training was done for BERT. We randomly ‘mask’ a few tokens within the input sequence, replace a few & keep some as it is. Right?

But suppose, when training for say 10 or 100 epochs, the sequence (once adulterated) remains the same for all epochs. In RoBERTa, the aim is to mask different tokens in the input sequence for every epoch trained. Hence, the input sequence doesn’t remain constant for all epochs & the model can learn way better. So, if the input sequence is

We arrived at the airport in time

Then we can prepare 10 (or maybe more) different versions of it like:

And can be fed to different epochs like

  • The next major change brought was pre-training just on MLM & ignoring NSP completely as they also believed it doesn’t add much value
  • More data corpus was used. Earlier, BERT was trained on 16 GB which raked up to 160 GB in RoBERTa
  • Bigger batch size while training
  • Using ByteEncoder compared to WordPiece in BERT. You can read about both WordPiece & ByteEncoder here

That’s all, moving onto

ELECTRA

ELECTRA stands for Efficiently Learning an Encoder that Classifies Token Replacements Accurately.

To be honest, these are just fancy names !!

So what ELECTRA does is replaces the MLM training tasks with a ‘Replaced Token Detection’ task.

What the heck is that?

It is similar to MLM where we instead of masking tokens, replace some tokens in the sequence with others & train the model to detect these replaced tokens.

Why is this required?

As discussed in the last post, BERT has 2 training phases i.e. Pre-Training (learning the language) & Fine-Tuning (training for a specific task say Q & A system). Now, we won’t get masked tokens in most tasks while Fine-Tuning though we train BERT on MLM. And hence, Fine-tuning may take time to train over pre-trained model/ may not learn properly. MLM was replaced by ‘Replaced Token Detection’ and no NSP task (again, looks unnecessary to researchers). Hence, pretraining is done just on replaced token detection problem !!

How do we replace tokens for preparing the training set?

This isn’t as straight forward as you think

The Generator Discriminator Trade-Off

So, training of ELECTRA involves a Generator,

1.Which follows the MLM approach used in BERT pre-training

2. Randomly masking a few tokens within the input sequence. Now, this sequence is passed onto BERT which gives probability for the most apt word in place of the mask

Now, this new input sequence (with predicted tokens in place of masked tokens) is fed to a Discriminator which

Intakes sequence output by the Generator & determine & classify each token as replaced / original.

We must remember that both Generator & Discriminator are BERT structures only. Also, once training is done, the Generator part is ignored. The Discriminator is what we call ELECTRA

An example is always great to have !!

Input Sequence: The chef cooked the meal

Step 1: Mask a few tokens = [ [MASK] , chef, [MASK] ,the, meal]

Step 2: Feed to Generator (BERT for detecting Masked tokens)

Step 3: Replaced masked tokens with the words with highest probability as output of generator. Lets assume it generated ‘a’ for ‘the’ & ‘ate’ for ‘cooked’.

Now, our output sequence becomes: a chef ate the meal. This becomes the input for Discriminator where it tries to detect original/replaced flags per token

But what advantage does it giving over BERT?

As in MLM pre-training, BERT focuses to determine a flag just for Masked tokens but nothing on other tokens hence training signal for the model is restricted while on the other hand the replaced token detection definitely has an edge as the training signal is in all tokens as it classifies each token.

I guess we should wrap it up for now !! will continue with the 2nd family of variants developed for BERT based on the idea of Knowledge Distillation in my next post.

--

--