ERNIE 2.0: A CONTINUAL PRE-TRAINING FRAMEWORK FOR LANGUAGE UNDERSTANDING

Published in

PsyAI

6 min readSep 7, 2019

Existing language representation models like word2vec, BERT, XLNet etc., are based on the co-occurrence of tokens or sequences. Language, however, is more complex than that. It contains features that are not captured by just co-occurence counts! ERNIE 2.0 provides a framework to learn to better represent language by using a suite of lexical, syntactic, and semantic pre-training tasks.

Background:

One of the important tasks in NLP is to learn a generalized representation of language in such a way that it can be used for all the downstream tasks (such as Sentimental Analysis, Named Entity Recognition). Representations learned only on one task, say text classification, may not be useful other tasks such text-pair similarity. One of the ways to attain generalization is to perform multitask learning,

Multitask learning is learning on a large amount of data across tasks to represent language better. In the figure above, for example, you see an encoder shared across tasks. Each task has its own architecture but will use the same shared encoder to represent language. During the training phase, the first task will be trained on a mini-batch of data corresponding to the first task. Based on the first task loss, the shared encoder will be updated. When the second task is trained, it will use this shared encoder to represent the input sequence from the mini-batch corresponding to the second task, learn on it, and then update the shared encoder based on the loss. Essentially, all tasks take turns to learn on their mini-batch of data, and update the shared encoder based on the loss. The trouble with this approach of multi-task learning is that the weights in the shared encoder can be affected by the introduction of new tasks, leading to catastrophic forgetting (the updates learned from previous tasks are forgotten drastically).

The objective, however, is to improve on new tasks without forgetting what is learned from previous tasks — just like human beings would not forget skating when they learn how to ski. This is known as Continual Learning.

Multitask learning is advantageous for two reasons: 1) It acts as a regularizer by forcing the shared encoder learn to generalize for all tasks 2) It can make use of a lot of data across tasks.

ERNIE 2.0 Framework

The way ERNIE 2.0 addresses the problem of catastrophic forgetting is to train several tasks in parallel, and update the shared encoder based on the average

loss from all the tasks:

Step 1: Only Task 1 is learned in this step. The input sequence is represented through the shared encoder, and the weights of the encoder are updated on the loss from Task 1.

Step 2: The shared encoder is initialized with weights from the previous step, and Task 1 and Task 2 both are learned simultaneously in this step. Since the shared encoder is already learned on Task 1, its loss will mostly be updated with respect to Task 2, thereby retaining what is learned from the previous task.

Step 3: Once both tasks are done computing their loss per batch, their average loss will be calculated, based on which the shared encoder will be updated. Both tasks are trained to continue until all the mini-batches are exhausted.

Step 4: The shared encoder is initialized with weights from the previous tasks, and Task 1, Task 2, and Task 3 are trained in parallel in this step. Since the encoder is already learned on Task 1 and 2, the loss will mostly be driven by Task 3, helping the encoder to learn represent data from all three tasks.

Therefore, this is the major contribution of ERNIE 2.0: it provides a framework for continuous incremental multitask learning where you can add your own set of pre-training training tasks.

Pre-training tasks generally use a large amount of data, and construct unsupervised or weakly-supervised tasks from them. This weakly-supervised data is then used to learn the language representation of the model.

ERNIE 2.0 also provides a set of pre-training tasks by cleverly constructing a labeled dataset from a large amount of unlabeled data. All the tasks are classification tasks and are framed as follows:

Word-aware pre-training tasks: These tasks help the models to learn the word representation better. There are three different tasks under this setting:

Knowledge Masking Task: In this setting, a words or a named entity or phrase is masked, and the model classifies the masked word(s) into word from the vocabulary size.
Capitalization Prediction Task: The model predicts whether a word is capitalized or not. This is helpful for downstream tasks like Named Entity Recognition.
Token-Document Relation Prediction Task: For each word in a segment, say, “A meme is an idea, or a behaviour, ….”, the model tries predict whether the word appears elsewhere in other segments in the original document. The assumption is that if the word does appear in other segments, it must be an important word and might be capturing the theme of the document. This helps the model to learn how important is a word for explaining the document.

Structure-aware Pre-training Tasks: These set of tasks help the model learn the relationship between sentences in a document.

Sentence Reordering Task: In this task, a paragraph is broken into several segments and are shuffled. The model reorders the shuffled segments into the original paragraph.
Sentence Distance Task: The model classifies if two sentences are adjacent to each other, in the same document, or in two different documents.

Semantic-aware Pre-training Tasks: In this task, semantics of sentences are learned.

Discourse Relation Task: Consider this sentence, “I took my umbrella this morning. [because] The forecast for rain was in the afternoon”. Here because is the discourse marker. The task is to identify the discourse marker. In the process, the model will have to learn the contrast between the two sentences, and hence their semantics.
IR Relevance Task: Given a pair of user query and document, the task is to identify whether the document is strongly-relevant, weakly-relevant, or irrelevant to the user query. The dataset is constructed based on Baidu’s search engine logs. That is, if a user clicked on a document that had query words in its title, it would imply strong relevance. However, if a user did not click on a document despite containing query words, that would imply weak-relevance. Finally, if the document title did not contain query words, it would mean irrelevance. This task helps the model learn the semantics of the query with respect to the document title.

Experiments

The experiments were conducted on English and Chinese datasets. In order to compare with BERT, two versions of ERNIE 2.0 models are built: base and large. The base and large model uses 48 and 64 NVidia v100 GPU cards respectively.

For English, tasks from GLUE were used. Chinese included 9 different tasks (Reading Comprehension, Sentimental Analysis etc.,). Refer the paper for specific details about the experiments.

Results

ERNIE 2.0 outperforms BERT and XLNet on GLUE with about 3 points improvement in the GLUE score over BERT (83.6). It also outperforms BERT and ERNIE in all the 9 tasks, attaining state-of-the-art results for all tasks.

Recently, I gave a presentation on ERNIE 2.0 at AISC, Toronto. This blogpost is based on the paper. Here are the link to the slides and the presentation.

ERNIE 2.0: A CONTINUAL PRE-TRAINING FRAMEWORK FOR LANGUAGE UNDERSTANDING

Background:

ERNIE 2.0 Framework

Experiments

Results

Written by Royal Sequeira