Transfer Learning in NLP: A Survey

Published in

Analytics Vidhya

10 min readNov 2, 2020

The limitations of deep learning models, such as requiring a large amount of data to train models and also demand for huge computing resources, forces research for the knowledge transfer possibilities. Nowadays many large DL models are emerging that demand the need for transfer learning. This survey aims to discuss the recent advances in using transfer learning in Natural Language Processing (NLP). Most of the work in this post is taken from “A Survey of Transfer Learning in Natural Language Processing” (Alyafeai et al. 2020).

Models used for NLP

Let’s start with the different models that are being used for NLP based on the three main architectures.

Recurrent-Based Models

Recurrent neural networks (RNNs) processes sequential data. In RNNs we pass the previous model state along with each input to make them learn the sequential context. RNNs have worked great for many tasks such as speech recognition, translation, text generation, time-series classification, and biological modeling. RNNs, unfortunately, suffer from the problem of vanishing gradient because of using backpropagation and its sequential nature. Because of the vanishing gradient problem the error decays severely as it travels back through the recurrent layers. To overcome this issue many ideas came such as using Rectified Linear Unit (ReLU) as the activation function, then Long Short Term Memory (LSTM) architecture, bidirectional LSTMs, Gated Recurrent Networks (GRUs). GRUs are the fastest versions of LSTMs and can beat LSTMs in some tasks such as automatic capturing the grammatical properties of the input sentences.

2. Attention-Based Models

In addition to the above problems of RNNs, it gives the same weight to each sequence of words with respect to the current processed word. Also, it aggregates the sequence activations in one vector which causes the learning process to forget about words that were fed in the past. Attention-based models on the other hand attend each word differently to inputs based on the similarity score. Attention can be applied between different sequences or in the same sequence which is called self-attention.

3. CNN-Based Models

Convolutional neural networks (CNNs) were initially proposed for image recognition tasks such as character recognition. It uses convolutional and max-pooling layers for sub-sampling. Convolutional layers extract features and pooling layers reduce the spatial size of the extracted features. In NLP, CNNs have been successfully used for sentence classification tasks such as movie reviews, question classification, etc. Character-level CNNs were used for text classiﬁcation. CNNs have also been used in language modeling where gated convolutional layers were used to preserve larger contexts and can be parallelized compared to traditional recurrent neural networks.

Language Models

Language modeling is the learning of a probability distribution over a set of tokens taken from a ﬁxed vocabulary. The following are the different approaches in the literature.

Unidirectional LM: In this technique we consider tokens either that are to the left of the current context or to the right. It is also known as auto-regressive encoding.
Bidirectional LM: In this technique, each token can attend to any other token in the current context. While using this technique the task of next word prediction becomes insignificant as any token can attend to next word prediction. To overcome that we generally use masked language models.
Masked LM: This technique is generally used in bidirectional LM where we randomly mask some tokens in the current context and then predict these masked tokens. It is also called denoising auto-encoding.
Sequence-to-sequence LM: This technique involves splitting the input into two separate parts. In the first part, every token can see the context of any other token in that part but in the second part, every token can only attend to tokens to the left.
Permutation LM: This language model combines the benefits of both auto-regressive and auto-encoding.
Encoder-Decoder LM: In comparison to other approaches that use a single stack of encoder/decoder blocks, this approach uses both the blocks.

The following table shows a comparison between various pre-trained models in the literature.

The table shows a comparison between various pre-trained models in the literature. Source: https://arxiv.org/pdf/2007.04239.pdf

Datasets

There are many datasets that have been used for NLP tasks in the past. The following table provides a summary of some of the datasets.

Transfer Learning

Now, we are going to discuss transfer learning in NLP. If we have a source domain-task tuple (Ds, Ts) and a different target domain-task tuple (Dt , Tt ), transfer learning can be defined as the process of using the source domain and task in the learning process of the target domain task. In mathematical terms, the objective of transfer learning is to learn the target conditional probability distribution P(Yt|Xt) in Dt with the information gained from Ds and where Ds ≠Dt or Ts ≠ Tt. In the following table, we are comparing different scenarios when the domain pair is different or the task pair is different.

All possible combinations of domain and task pair. Source: https://arxiv.org/pdf/2007.04239.pdf

Types of Transfer Learning

Transfer Learning in NLP can be broadly categorized into two:

Transductive Transfer Learning
Inductive Transfer Learning

Transductive Transfer Learning

Transductive Transfer Learning is when for the same task, the target domain or task doesn’t have labeled data or has very few labeled samples. It can further be divided into the following sub-categories:

A. Domain Adaptation: This involves learning about a different data distribution in the target domain. It is useful if the new task to train-on has a different distribution or the amount of labeled data is scarce. In one of the recent work, for transferring knowledge from multiple domains to a single domain researchers applied a teacher-student model in an unsupervised approach. For domain similarity, they used three measures which are Renyi divergence, Jensen-Shannon divergence, and Maximum Mean Discrepancy. Out of 12 domain pairs, the model achieved state-of-the-art results on 8 for single-source unsupervised domain adaptation.

Another work involved using adversarial domain adaptation for the detection of duplicate questions. This approach had three main components: an encoder, a similarity function, and a domain adaptation module. The encoder encoded the question and was optimized to fool the domain classiﬁer that the question was from the target domain. The similarity function calculated the probability for a pair of questions to find they were similar or duplicate. And the domain adaptation component was used to decrease the difference between the target and source domain distributions. This approach proved better and achieved an average improvement of around 5.6% over the best benchmark for different pairs of domains.

B. Cross-lingual transfer learning: This involves adapting to a different language in the target domain. This approach is useful when we want to use a high-resource language to learn corresponding tasks in a low-resource language. In one of the work, researchers proposed a model for pos-tagging in a cross-lingual setting where the input and output languages have different input sizes. The model used two bidirectional LSTMs (BLSTMs) called as common and private BLSTM. The common BLSTM had shared parameters between the languages whereas the private BLSTM had language-speciﬁc parameters. The outputs of the two modules were then used to extract POS tagging using cross-entropy loss optimization. In this model language, adversarial training is used which forces the common BLSTM to be language-agnostic. This approach showed significant results for POS tagging on 14 languages without any linguistic knowledge about the relation between the source and target languages.

In another work, a new dataset was used to evaluate three different cross-lingual transfer methods on the task of user intent classiﬁcation and time slot detection. The dataset contained 57k annotated utterances in English, Thai, and Spanish and was categorized into three domains which were reminder, weather, and alarm. The three cross-lingual transfer methods used were translating the training data, using cross-lingual pre-trained embeddings, and novel methods of using multilingual machine translation encoders as contextual word representations. The latter two methods outperformed the translation method on the target language that had only several hundred training examples, i.e., a low resource target language.

Inductive Transfer Learning

Inductive Transfer Learning is when for different tasks in the source and target domain we have labeled data in the target domain only. It can be divided into two sub-categories:

A. Sequential Transfer Learning: It involves learning multiple tasks in a sequential fashion. It is further divided into five sub-categories:

Sequential Fine Tuning: Fine-tuning involves the training of the pre-trained model on the target task. A huge amount of work has been done under this category in the past few years. One of the recent works involves the model for a uniﬁed pre-trained language model i.e., UNILM. It combines three different training objectives to pre-train a model in a uniﬁed way which includes Unidirectional, Bidirectional, and Sequence-to-Sequence. The UNILM model achieved state-of-the-art results on different tasks including generative question answering, abstractive summarization, and document-grounded dialog response generation.

Another work involves studying the knowledge retrieval performance of large language models. In this work, researchers investigated the task of open-domain question answering with the constraint that any external resources cannot be looked up to answer the questions. The investigation was done using a T5 pre-trained model which has 11 billion parameters and therefore, can store a large amount of knowledge that can be extracted for a speciﬁc task. Also, T5 is a text-to-text model which makes it suitable for an open domain question answering task. The task was mapped to the T5 model by using the question as an input with the speciﬁc task label and it predicted the answer as an output. The results suggest that this approach outperformed models that explicitly look up answers using an external domain.

Adapter Modules: They are a compact and extensible transfer learning method for NLP which provides parameter efficiency by only adding a few trainable parameters per task, and as new tasks are added previous ones don’t require revisiting. In the latest work, adapter modules were used to share the parameters between different tasks by ﬁne-tuning the BERT model. Projected attention layers (PALs) were used which are low dimensional multi-head attention layers that are trained with the attention layers of BERT in parallel. The model was evaluated against GLUE tasks and obtained state-of-the-art results on text entailment while achieving parameter efficiency.

Feature-Based: In this approach, the representations of a pre-trained model are fed to another model. It provides the benefit of using the task-specific model again for similar data. Also, extracting feature once saves a lot of computing resources if the same data is used repeatedly. In one of the recent work, researchers used a semi-supervised approach for the task of sequence labeling. A pre-trained neural language model was used that was trained in an unsupervised approach. It was a bidirectional language model where both forward and backward hidden states are concatenated together. The output was then augmented to token representations and fed to a supervised sequence tagging model (TagLM) which was then trained in a supervised way to output the tag of each sequence. The datasets used were CoNLL 2003 NER and CoNLL 200 chunking. The model achieved state-of-the-art results on both tasks compared to other forms of transfer learning.

Zero-shot: It is the simplest approach where for a given pre-trained model we don’t apply any training procedure to optimize/learn new parameters. In a recent study, researchers used the zero-shot transfer on text classiﬁcation. Each classiﬁcation task was modeled as a text entailment problem where the positive class meant an entailment and the negative class meant there was non. Then a pre-trained Bert model on text classiﬁcation in a zero-shot scenario was used to classify texts in different tasks like emotion detection, topic categorization, and situation frame detection. This approach achieved better accuracy in two out of the three tasks in comparison to unsupervised tasks like Word2Vec.

B. Multi-Tasks Learning: It involves learning multiple tasks at the same time. For instance, if we are given a pre-trained model and want to transfer the learning to multiple tasks then all tasks are learned in a parallel fashion.

Multi-task Fine Tuning: In recent work, the researchers used this approach to explore the effect of using a uniﬁed text-to-text transfer transformer (T5). The architecture used was similar to the Transformers model with an encoder-decoder network. But it used fully-visible masking instead of casual masking especially for inputs that require predictions based on a preﬁx like translation. The dataset used for training the models was created from the common crawl dataset which was around 750GB. The model required around 11 billion parameters to be trained on such a large dataset. Multi-task pre-trained models were used to perform well on different tasks where the models were trained on different tasks by using preﬁxes like: “Translate English to German”. By fine-tuning the model achieved state-of-the-art results on different tasks like text classiﬁcation, summarization, and question answering.

Conclusions and future scope

In this survey, we see that attention-based models are more popular compared to RNN-based and CNN-based language models. Also, BERT appears to be the default architecture for language modeling due to its bidirectional architectures which makes it successful in many downstream tasks. In transfer learning techniques used for NLP, sequential ﬁne-tuning seems to be the most popular approach. Also, multi-task ﬁne tuning seems to gain popularity in recent years as many studies found that training on multiple tasks at the same time yields better results. Also, we see that text classiﬁcation datasets are more widely used compared to other tasks in NLP because fine-tuning models in such tasks are easier.

For future work, it is recommended to use bidirectional models like BERT for specific tasks like abstractive question answering, sentiment classiﬁcation, and parts-of-speech tagging and models like GPT-2, T5 for generative tasks like generative question answering, summarization, and text generation. Also, in the future, adapter modules can replace sequential ﬁne-tuning as they show comparable results to traditional ﬁne tuning and are faster and compact due to parameter sharing. To conclude I think extensive research should be done to reduce the size of these bigger language models so that they can be easily deployed on embedded devices and on the web.

References

A Survey on Transfer Learning in Natural Language Processing

Deep learning models usually require a huge amount of data. However, these large datasets are not always attainable…

arxiv.org