Analytics Vidhya
Published in

Analytics Vidhya

What happens after Bert ? Summarize those ideas behind


Various models and thinking have been dizzying. What are they trying to tell us? hopes this article will let you clear after reading.

We will get started from the following:

  • Increase coverage to improve MaskedLM
  • NextSentencePrediction 👎 ?
  • Will other pre-training tasks better?
  • Make it small
  • multi-language
  • Bigger models, better results?
  • Multitasking


The reason why Bert’s model so incredible is because it changed the way of training an NLP model.

Using a large-scale corpus to train a semantic model, and then use this model to do down-stream tasks like reading comprehension/emotion classification / NER, etc

This also called self-supervised learning by Yann LeCun

Bert using a multi-tasking model based on Transformer Encoder with task MaskedLM and NextSentencePrediction to capture semantics.

Increase coverage to improve MaskedLM
In MaskedLM, MASK is performed on one-pieces after WordPiece.

It is not difficult to get guess ‘tok’ when ‘##eni’ and ‘##zation’ are provided, compare to guess the whole word from its context.

Due to the association between word itself and word with others are different, Bert may not be able to learn the relationship between words and words.

It is not significant to predict part of a word, to predict the whole word can learn its semantics more. Therefore, it is imperative to expand the coverage of masking:

Masking on whole word —wwm
Masking on Phrase level — ERNIE
Scaling to a certain length — Ngram Masking / Span Masking

Phrase level needs to provide a corresponding phrase list. Providing such artificially added messages may disturb the model, give it a bias. It seems that maks on longer length should be a better solution, so T5 try on different lengths to reach this conclusion:

It can be seen that increasing the length is effective, but it does not mean that longer is better. SpanBert has a better solution, to reduce the chance of Mask overly long text through probability sampling.

SpanBert’s experimental results:

Change the proportion of Masked
Google’s T5 tries different masked ratios to explore what the best parameter settings are.Surprisingly, bert original setting is the best :

NextSentencePrediction 👎?
NSP learns sentence-level information by predicting whether two sentences are contextual. From the experimental result, it didn’t give much improvement, and even drops on some tasks.

NSP doesn’t seem to work well! This became a place for everyone to siege, the following papers all stepped on it: XLNET / RoBERTa / ALBERT



It found that NSP brings more negative effects! It may due to the unreasonable design of the NSP task — negative samples are sampled from other documents that are easy to distinguish, resulting not only less knowledge to learn but also noisy. Also, NSP divide input into two different sentences and the lack of long sentence samples makes Bert’s poor behave on long sentence.

Other pre-train tasks
NSP is mediocre, is there a better way to pre-train? Everyone tried a variety of ways, and I think the best way to summarize the various pre-training tasks is Google ’s T5 and FB ’s BART.
T5 tried

The way BART tried

Usually, the language model will be used as the baseline for everyone.

  • Cover some tokens, predict what is covered
  • Shuffle the order of sentences and predict the correct order
  • Delete some tokens, predict where to delete
  • Randomly pick tokens, after that, all content will be moved to the beginning, and predict where the correct beginning is.
  • Add some tokens and predict where to delete
  • Replace some tokens and predict where they have been replaced

The results of the experiment are as follows:

These experiments found that MaskedLM is the best pre-training method. For better results, longer Mask and longer input sentences seem to be a more effective way to improve. To avoid leaking how many words were masked, you can only mark a mask and predict one or more Word result


Bert’s model is very large. In order to make the runtime faster, the other direction is to lightweight the model.
All The Ways You Can Compress BERT has detailed this.
The directions are:

  • Pruning-delete parts of the model, delete some layers, some heads
  • Matrix factorization-matrix factorization for vocabulary / parameters
  • Distillation of Knowledge-Bert’s “learning” on other small models
  • Parameter sharing-share the same weight between layers

Model and effect can refer to the original


The datasets in a different language are very uneven. Usually, there are a large number of English-language datasets, and other languages ​​have relatively few data. In traditional Chinese, this problem is even worse. Since Bert’s pre-training method has no language restrictions. putting more language data into a pre-train model hopes it can achieve better results on downstream tasks.

Bert-Multilingual released by Google is one of an example. It achieves closely results in Chinese model on downstream tasks without any addition to Chinese data.

In Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model, found that multi-language version of Bert finetuned on SQuAD (English Reading Comprehension Task) and evaluate on DRCD (Chinese Reading Comprehension Task). It can achieve results close to QANet; and multilingual model does not translate the data into the same language, which is better than translation!

The above results show that Bert has learned to link data in different languages, either in Embedding or in the transformer encoder.
Emerging Cross-lingual Structure in Pretrained Language Models wants to understand how bert connects different languages.
Firstly, it connects different languages ​​in the same pre-trained model using TLM:

Then, it tries to figure out which part affects the result most, by shared component or not.

Parameter sharing between models is the key to success

This is because Bert learns the distribution of a word and context behind it. In different languages, the same meaning of words, the distribution of context should be close.

And Bert’s parameter is to learn the distribution among them, which makes such an amazing effect on multilingual transfer.

Bigger models, better results?

Although Bert has used a large model, intuitively, the more data, the larger the model, the better the effect should be. It may also a key to improve:

T5 using TPU and magic of money impute it to summit

the larger model seems not giving much improvement

Therefore, simply increasing the model is not the most effective method. Using different training methods and goals is also a way to improve the results.
For example, ELECTRA uses a new training method to make every word participate so that the model can learn representation more effectively.

Albert used parameter sharing to reduce the number of parameters while the effect did not decrease significantly.


Bert uses multi-task for pre-train.More then that, we can also use multi-task on fine-tuning. Multi-Task Deep Neural Networks for Natural Language Understanding (MTDNN) is doing that.

compare to MTDNN, GPT2 is more radical: using an extreme language model to capture everything without fine-tuning, just give a signal of the task, and it can handle the rest.It is impressive, but still far from success.

T5 makes it a balance

Google’s T5 similar to GPT2, training the generative model to generate all text answers. Also like MTDNN, during training, it will let the model know that it is now solving different tasks, and it is a training / fine-tune model.

there are two problems such a large scale pre-training model needs to duel with: handle imbalanced data & training strategy.

Handle imbalanced data

The amount of data between tasks are different, which causes the model to perform poorly for some tasks with small data.
Reduces the sampling of large amounts of data and increases the sampling of small amounts of data is one of the solutions. How Bert train on multi-language is one of the examples:

To balance these two factors, we performed exponentially smoothed weighting of the data during pre-training data creation (and WordPiece vocab creation). In other words, let’s say that the probability of a language is P (L), eg, P (English ) = 0.21 means that after concatenating all of the Wikipedias together, 21% of our data is English. We exponentiate each probability by some factor S and then re-normalize, and sample from that distribution. In our case we use S = 0.7. So, high-resource languages ​​like English will be under-sampled, and low-resource languages ​​like Icelandic will be over-sampled. Eg, in the original distribution English would be sampled 1000x more than Icelandic, but after smoothing it’s only sampled 100x more .

Training strategy

  • Unsupervised pre-training + fine-tuning refers to the results of fine-tuning on various tasks after pre-training of T5
  • Multi-task training is to train T5 pre-training and all tasks together, and verify the results directly on each task
  • Multi-task pre-training + fine-tuning is to put T5 pre-training and all tasks together to train, then fine-tune the training data of each task, and then verify the results
  • Leave-one-out multi-task training is to perform multi-task training on T5 pre-training and tasks other than the target task, then fine-tune the target task’s data set, and then verify the results
  • Supervised multi-task pre-training will directly perform multi-task training on all data, and then fine-tune the results on each task

It can be seen that after a large amount of pertaining data, fine-tune on specific data can alleviate the problem of data imbalance when pre-training a large amount of data.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
SpanBERT: Improving Pre-training by Representing and Predicting Spans
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Cross-lingual Language Model Pretraining
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
All The Ways You Can Compress BERT
Bert multilingual
Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
Bert Multilingual
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
Emerging Cross-lingual Structure in Pretrained Language Models



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store