Transfer learning and NLP — unforeseen amalgamation

Published in

Analytics Vidhya

10 min readSep 24, 2019

In this post, we will see how deep learning is used for solving traditional NLP problems very fast and with ease. I will be using fastai library. I would recommend the readers to go through image classification post before reading this as some of the terms I will be using are explained there.

A quick digest

NLP refers to Natural Language Processing, where we input some text and do something with it. We will, in particular, classify the input text into categories. Because we are going to use IMDB dataset to classify the movie reviews into positive or negative. Other than that, text classification can be used in the following areas:

Spam prevention
Identifying fake news
Finding a diagnosis from medical reports

The dataset we will be using has around 25,000 sets of reviews and for each review, a bit of information if the user liked it or not. This is pretty much a small number to train our model to learn, to speak the English language so that it could detect if the review is positive or negative. That’s a small dataset. The neural network is a combination of linear matrices multiplication followed by a non-linear activation function. And we want to train our neural net using 25,000 reviews and then use it to detect any comment. The problem lies when the review is a sarcastic one.

So, we will use transfer learning for the purpose. This is a great trick to use deep learning in pretty much any text classification problem. Fastai has successfully implemented transfer learning for NLP related problems. Thanks to it.

Transfer learning in NLP

Like we used the ImageNet for image classification where we used the already trained ImageNet model to classify our dataset of images, similarly here we will use language model.

❓ What is a language model
Language model has a specified meaning in NLP. A language model is a model that knows about English. It helps in predicting the next possible word. It has good word knowledge. Some examples of work knowledge are like below:

“I want to know about machine ____”: parts, learning, manufacturing
“Fastai has made everything ____”: easy, fast

Traditional NLP approach
Previous approaches in NLP were based on n-gram strategy, which says how often do these pairs or triplets of words tend to appear next to each other. And you may guess how bad it would have been. Thanks to neural nets and of course, fastai.

Self-supervised learning

❓ What is Wikitext 103
Wikitext 103 is a pre-trained language model using the subset of most of the most substantial articles from Wikipedia. It knows how to read, understand, and predict the next word as per the Wikipedia. That’s a lot of information. It nearly consists of billion of tokens.

Why is that useful? Because at that point, I’ve got a model that knows probably how to complete sentences, so it knows quite a lot about English and quite a lot about how the real world works.

Since we now have a pre-trained Wikitext language model, we can use this language model to build a model that’s good at predicting the next word of movie reviews. We will fine-tune this trained model to learn how to write movie revies in particular. So for all of this pre-training and all of this language model fine-tuning, we don’t need any labels at all. It is what the researcher Yann LeCun calls self-supervised learning. So, our only work is to train our model for movie reviews. We can predict the labels easily if our model knows about the movie review world using a neural network. That’s an easy thing.

I will import the text module from the fastai library for text classification.

from fastai.text import *

Fastai has an inbuilt dataset for IMDB reviews. We will use a sample dataset to understand the data inside the sample. Later on, we will download the complete dataset. Let us untar the sample dataset.

path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

It’s just in a CSV file, so we can use pandas to read it, and we can take a little look.

df = pd.read_csv(path/'texts.csv')
df.head()

Now there are two ways to train the model. One is using data-block API, and the other is without it. Lets first try without using data-block API.

— 1st way

data_lm = TextDataBunch.from_csv(path, 'texts.csv')

You can save the data bunch, which means that the pre-processing that is done, you don’t have to do it again. You can load it.

data_lm.save()

You may play around with the training data bunch like below:

data_lm.train_ds[0][0], data_lm.train_ds[0][1]

Now, when using the images, they can be easily fed into the model as pictures are the bunch of float numbers. But, a text is composed of words, and we can’t apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two different steps: tokenization and numericalization. A TextDataBunch does all of that behind the scenes for you.

❓ What is tokenization
Tokenization is a process of taking a word and converting it into a standard form of tokens. Each token represents a word. The more is the tokenization, the better the model learns. For example:

“didn’t” gets tokenized to — did, n’t
“you’re” gets tokenized to — you, ‘re
Some special rarely used words get tokenized in the form of xxunk, xxud, etc. Anything’s starting with xx in fastai is some special token.

We can check the tokens as below:

data = load_data(path)
data.show_batch()

❓ What is numericalization
It is the process of assigning every token a unique number. The number corresponds to their position in the vocabulary.

We can check the vocabulary like below:

data.vocab.itos
(itos converts numbers assigned to the images to strings)

And if we look at what a what’s in our datasets, we’ll see the tokenized text as a representation:

data.train_ds[0][0]

But the underlying data is all numbers:

data.train_ds[0][0].data[:10]

— 2nd way

Now, we will create the data bunch using fastai data-block API.

data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

What kind of list you’re creating (i.e., what’s your independent variable)? So, in this case, my independent variable is text.
What is it coming from? A CSV.
How do you want to split it into validation versus training? So in this case, column number two was the is_valid flag.
How do you want to label it? With positive or negative sentiment, for example. So column zero had that.
Then turn that into a data bunch.

Language model

Here we will build our language model using the already defined Wikipedia language model. Before creating the language model, we will use complete IMDB dataset.

path = untar_data(URLs.IMDB)
path.ls()

(path/'train').ls()

The reviews are in a training and test set following an imagenet structure. Like in computer vision, we’ll use a model pre-trained on a bigger dataset ( wikitext-103). That model has been trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that ‘knowledge’ of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pre-trained model to our particular dataset. Because the English of the reviews left by people on IMDB isn’t the same as the English of Wikipedia, we’ll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in Wikipedia, and therefore might not be part of the vocabulary the model was trained on. This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let’s create our data object with the data block API.

bs= 64
data_lm = (TextList
            .from_folder(path)        
            .filter_by_folder(include=['train', 'test']) 
            .split_by_rand_pct(0.1)
            .label_for_lm()           
            .databunch(bs=bs))data_lm.save('data_lm')

from_folder — We want to grab all of our data from folder whole path we give as an attribute to the function.
filter_by_folder — We want to filter our data using the folders. Separate folder for test and train data is already provided.
split_by_rand_pct — We want to split training and test dataset by 10% for validation purpose. We are using test dataset as well because no labels are provided along with the dataset. Since we are creating a learner to learn the language, therefore we do not need any label.
label_for_lm — We want to use the learning model labels by default. That’s more of a way than to have any satisfactory explanation. 🍶

data_lm = TextLMDataBunch.load(path, 'data_lm', bs=bs)data_lm.show_batch()

Training

Instead of using cnn_learner, we will language_model_learner. As usual, when we create a learner, you have to pass in two things:

The data: so here’s our language model data
What pre-trained model we want to use: here, the pre-trained model is the Wikitext 103 model.

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
learn.lr_find()learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.save('fit_head')

To complete the fine-tuning, we can then unfreeze and launch new training.

learn.unfreeze()
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))
learn.save('fine_tuned')

Prediction

TEXT = "I liked this movie because"
N_WORDS = 40learn.predict(TEXT, N_WORDS, temperature=0.75)

Now, we have a language model that could predict the next words for the movie reviews. Now, we will use this language model to build our classifier.

At this point, we have a movie review model. So now we’re going to save that to load it into our classifier (i.e., to be a pre-trained model for the classifier). But I don’t want to keep the whole thing. I will save only the first part of the learner, which is also known as an encoder. 🍶

learn.save_encoder('fine_tuned_enc')

❓ What is inside save_encorder

Language Model Learner used above is the RNN modal which basically taken a sentence as input and spits out the vector of activations that represent the meaning and structure of input sentence because if we could do so, we could add another modal which takes output vector representation of input sentence as input and outputs the predictions about the next word. So, it basically consists of two parts.

Encoder — Part that takes the sentence as input and spits out the representation of the sentence so far
Classifier — That's taking the encoded representation of the sentence so far and spits out the next word.

And therefore, we are only storing the encoder part of the modal so as to feed the same weights further to create our own Sentiment Classifier Modal(since we do not care about the next bit) rather than the Language Modal defined above.

Classifier

Now we’re ready to create our classifier. Step one, as per usual, is to create a data bunch, and we’re going to do the same thing:

data_clas = (TextList
              .from_folder(path, vocab=data_lm.vocab)
              .split_by_folder(valid='test')
              .label_from_folder(classes=['neg', 'pos'])
              .filter_missing_y()
              .databunch(bs=50))
data_clas.save('tmp_clas')

from_folder — Grab the data from the folder from the input path. We want to use the same vocab as of our language learner.
split_by_folder — We are declaring our test folder, the validation dataset.
label_from_folder — We want to use the folder name as labels, depending on the dataset.
filter_missing_y — Removes the missing data samples from both x and y.

data_clas = load_data(path, 'tmp_clas', bs=bs)
data_clas.show_batch()

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
learn.freeze()

This time, rather than creating a language model learner, we’re building a text classifier learner. But again, the same thing﹣pass in the data that we want to figure out how much regularization we need. If you’re overfitting, then you can increase this number (drop_mult). If you're underfitting, you can decrease the number.

And most importantly, load in our pre-train model. Remember, specifically it’s this half of the model called the encoder, which is the bit that we want to load in.

learn.lr_find()
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))
learn.save('first')

I’m not going to say unfreeze. Instead, I'm going to say freeze_to. What that says is unfreeze the last two layers, don't unfreeze the whole thing. As per fastai, it really helps with these text classification not to unfreeze the whole thing, but to unfreeze one layer at a time.

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))
learn.save('second')

unfreeze the last two layers
train it a little bit more
unfreeze the next layer again
train it a little bit more
unfreeze the whole thing
train it a little bit more

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))
learn.save('third');learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

❓ Why are we dividing the learning rate by 2.6**4
As per the decriptive learning rates, the difference between the bottom of the slice and the top of the slice is basically what’s the difference between how quickly the lowest layer of the model learns versus the highest layer of the model learns. As per the fastai docs, dividing the upper of slice by this magic number defines much better output results.

That’s all. We have used transfer learning in detail for NLP. We have used some data augmentation parameters like moms, drop_mult, which I will try to explain in another post. Feel free to explore the fastai library.