ELECTRA — Addressing the flaws of BERT’s pre-training process
BERT and its relatives use a pre-training procedure that does not utilize the data to its full extent. This wastes computational resources while leaving a lot of performance to be gained.
ELECTRA introduces a pre-training framework that enables BERT-small’s GLUE performance to be achieved with the same size model for 12x less compute. It even achieves state-of-the-art results from RoBERTa with 4x less compute. Could this efficient pre-training process be the new default?
BERT has since its origin been the default algorithm for NLP tasks. Its success can be contributed to two factors: its transformer-encoder architecture (enabling deep, bi-directional language understanding) and its pre-training process. If we examine the latter in detail, which in part consists of Masked Language Modelling (MLM), it does not take long for us to find some areas of potential improvement.
- MLM replaces 15% of tokens with the special [MASK] token and train BERT to reconstruct the original sequence with the remaining, unmasked context. This approach severely limits the token efficiency, the amount of language understanding gained per token during the pre-training phase. It is however not possible to address this inefficiency by simply increasing the mask-token ratio. Devlin et al. most certainly used 15% because it performed well in general. It will, nonetheless, never be possible to achieve 100% token efficiency with this approach since its dependent on the context for its masked token predictions.
- The [MASK] token is only present during pre-training, which precedes the fine-tuning step. This results in different token distributions for the two stages, even though Devlin et al. implement measures to limit its negative impact.
Addressing these limitations would require an entirely new pre-training approach. If only there were someone who could help us! 😉This someone is Clark et al. and their solution presented in ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators. Let’s explore the main ideas presented in this paper!
ELECTRA’s pre-training process
The two flaws mentioned above revolve around the [MASK] token, for which BERT is trained to reconstruct the original token during the pre-training phase. This makes BERT a generative model, predicting words rather than labels/classes which would have made it a discriminatory model. The framework presented by Clark et al. utilize both these kinds of models during pre-training; a generator and a discriminator. To get a sense for this setup, study the following schematic:
As shown in the figure above the pre-training framework utilize a small BERT model, trained through MLM, to construct a corrupted sequence (the one found in the middle). This is achieved through sampling the most likely token at each position where the original token was masked. (The likelihood of each token is determined by the softmax activations at that position, which even after completed training only picks the correct token about 65% of the time).
The corrupted sequence, constituting of replaced tokens and some original ones, makes up the input to the discriminator — ELECTRA (which by the way is short for Efficiently Learn an Encoder that Classifies Token Replacements Accurately). This model’s task, as the abbreviation subtly suggests, is to take each token and predict whether or not it has been replaced by the generator. Formulating the pre-training framework this way allows ELECTRA to learn from every token in the input sequence — instead of only 15% of them! Training the model in this way solves both previously mentioned issues.
After pre-training the generator and ELECTRA in conjunction with the above-described process, the generator can be thanked for its service and be discarded. What we are left with is the discriminator which should be thought of as a pre-trained BERT model. This model can from here be used for fine-tuning on task-specific data, just like we would with BERT.
Before moving on to the results, let’s mention model size in this setup. Clark et al. found that the best performance was achieved when the generator was smaller than the discriminator. They argue that the corrupted sequence would be too difficult for the discriminator to decipher if the size-relation would be the other way around.
Another implication worth mentioning is the increased pre-training computational cost compared to BERT, given the same model parameters. To train ELECTRA of the same size as BERT-base requires us to also train a smaller discriminator too. This effect is analyzed in detail by Clark et al. and used as part of their comparison where performance per FLOPs allow us to compare ELECTRA to other models. Remember, what they want to achieve is a more efficient pre-training algorithm in terms of absolute computational cost, not necessarily in terms of training steps which is the more common approach to compare models.
The higher token efficiency allows for faster learning (improved performance / compute) which enabled training even with limited computing resources. Clark et al. found that a discriminator equivalent in size to BERT-small is able to achieve similar GLUE performance (75.1 GLUE points) with 12x less computing compared to BERT-small. To put that in perspective, a training process that previously took 4 days to complete on a single V100 GPU can now be finished in just 9 hours!
Training this BERT-small equivalent ELECTRA for the same amount of computing results in a significantly higher GLUE score, 79.9 compared to 75.1. This puts this ELECTRA model close to BERT-base’s performance (82.2 GLUE points) while using 45x less compute.
The figures below summarise ELECTRA’s improvements over BERT in a nice way. On the left, we find that the pre-training framework explained above has the biggest impact on smaller models but should not be disregarded for larger ones. Plotting computational cost as shown on the right highlights the speed at which ELECTRA is able to learn.
The story is much the same even in the state-of-the-art regime. ELECTRA is able to achieve comparable results to RoBERTa and XLNet on GLUE with 4x less compute and better results with the same amount of compute.
While improving upon state-of-the-art results truly is an impressive feat, I do find the improvements on smaller models much more exciting. I feel this way because these are improvements mere mortals, who do not have access to stacks of cloud TPU’s, can benefit from!