Transfer learning in NLP using fastai

…and why it works better than n grams?

Dipam Vasani

Published in

Becoming Human: Artificial Intelligence Magazine

6 min readMar 30, 2019

Find the full jupyter notebook here.

Traditional approaches

Previous approaches to sentiment analysis involved using something called as n-grams. We would first convert our text into tokens (or vocabulary) and then use those tokens to represent the sentences in our text as a sparse matrix.

For example if our vocabulary went like ["the", "it", "was", "when", "which","edit","introduction", "best", "video", "time"] and we saw a sentence like “it was the best time” then this sentence would be represented as [1,1,1,0,0,0,0,1,0,1] . This would be done for every sentence in our data and the final matrix would be used for training. However, this approach would misclassify things like not good for which we would use a combination of words as tokens (“it was”, “was the” and so on.) Find the full jupyter notebook here to learn more about this and similar approaches.

Trending AI Articles:

1. Deep Learning Book Notes, Chapter 1
2. Deep Learning Book Notes, Chapter 2
3. Machines Demonstrate Self-Awareness
4. Visual Music & Machine Learning Workshop for Kids

An important drawback

The drawback of using this approach, of representing text as tokens is that, essentially, it cannot understand English. The structure of the sentence is not really taken into consideration, only the frequency of the words is used. It does not know the difference between “I want to eat a hot __” and “It was a hot ___”. And because it cannot understand English, it cannot understand movie reviews and identify whether someone really liked a movie or not.

In order to overcome this drawback, we will once again resort to transfer learning. We will use a pre-trained model called language model. What is it and how do we train a model using it? Let’s find out.

The dataset

For this article, we will be using Cornell’s movie review datasetv2.0 which has 1000 positive and 1000 negative reviews. Using this data, we create our own data bunch as follows:

Creating a data bunch creates a separate token for every word. Most of the tokens are words but some of them are also question marks and braces and apostrophes.

We now find all the unique tokens and calculate their frequencies. This big list of tokens is called vocab. Here’s the first ten in order of frequency:

We see a lot of junk words here starting with xx. Here’s the thing.

Every word in our vocab is going to require a separate row in a weight matrix in our neural net. So to avoid that weight matrix from getting too huge, we restrict the vocab to no more than (by default) 60,000 words. And if a word doesn’t appear more than two times, we don’t put it in the vocab either.

In this way, we keep the vocab to a reasonable size. When you see these xxunk, they’re actually unknown tokens. It just means this was something that was not a common enough word to appear in our vocab. There are some special tokens though.

For example, xxfld: This is a special thing where if you’ve got like title, summary, abstract, body, (i. e. separate parts of a document), each one will get a separate field and so they will get numbered (e.g. xxfld 2 ).

Let’s now look at one review and it’s corresponding data representation as an array (token numbers).

Now that we have our data in place, we can start thinking about modelling. Our approach will look like this:

As discussed earlier, instead of just using one bit for every word and then deciding whether a person liked a movie or not, we want our model to learn some English. For this we are going to require a bigger set of documents than just our review data.

Wikitext-103 is a subset of most of largest articles from Wikipedia with a little bit of processing, and is available for download. Jeremy from fastai used this dataset, and built a language model on it.

Language model

A language model is a model that predicts next word in a sentence. To predict the next word in a sentence, you need to know quite a lot about the English language. By this we mean, being able to complete the following sentences:

I want to eat a hot __. (dog), It was a hot __. (day)

So Jeremy built a neural net, which will predict next word in every significantly sized Wikipedia article. And that’s a lot of information. Something like billions of tokens. So we got billions of words to predict, we make mistakes in those predictions, we get gradients from that, we can update our weights and we can try to get better and better until we get pretty good at predicting next words in Wikipedia.

Why is this useful?

Because at that point we have got a model that knows how to complete sentences like this. So it knows quite a lot about English and a lot about how the world works. The model can also tell who the president was during different years.

On top of this language model, we show it our own data (without the labels).

This way we train it from being good at predicting words in Wikipedia articles to being good at predicting words in movie reviews (our specific case).

After training for a few epochs, we get an accuracy of ~30% which is pretty good for a language model.

Let’s try to make some predictions using this language model.

And one more!

Notice that the output does not make much sense nor is it grammatically correct but it sounds vaguely like English. We can now save the encoder part of this model (the part that understands English) and use it as a pretrained model.

We use this encoder and the data bunch we created earlier (with the labels) to train our model. We’ve managed to achieve a 92% accuracy. Let’s take a look at some of the predictions.