Using FastAI’s ULMFiT to make a state-of-the-art multi-class text classifier

Published in

Technonerds

9 min readJul 26, 2019

Text classification is a classic ML problem that has been notoriously difficult to solve. However, latest developments in the field gives us some hope — we might be able to classify text much better if we understand language, rather than just words. ULMFiT, by Jeremy Howard et. al. from fast.ai, gives us an incredibly powerful method to classify text using language modelling and transfer learning.

Overview

ULMFiT stands for Universal Language Model Fine-tuning for Text Classification and is a transfer learning technique that involves creating a Language Model that is capable of predicting the next word in a sentence, based on unsupervised learning of the WikiText 103 corpus. The ULMFiT model uses multiple LSTM layers, with dropout applied to every layer (the secret sauce), developed by Steve Merity (Salesforce) as the AWD-LSTM architecture.

While the underlying concepts behind ULMFiT are complex and involve deeper understanding of machine learning models, the fast.ai wrapper that Howard developed makes NLP language modelling and text classification extremely easy.

ULMFiT is described in detail in Howards’s fast.ai MOOC which you can watch here.

The Dataset

We will be using a Kaggle dataset that has 20,000 Stack Overflow question titles classified into 20 categories, as follows:

  1 wordpress
  2 oracle
  3 svn
  4 apache
  5 excel
  6 matlab
  7 visual-studio
  8 cocoa
  9 osx
  10 bash
  11 spring
  12 hibernate
  13 scala
  14 sharepoint
  15 ajax
  16 qt
  17 drupal
  18 linq
  19 haskell
  20 magento

20,000 items in a dataset is relatively small, but ULMFiT is built to support text classification on such training sets. I have had good success (~60% accuracy of prediction, but significantly higher if you consider top two predicted categories) with a training set of just 1000 items in 7 categories.

The Code

I’ll be using a Google Colab notebook to make the classifier. Feel free to experiment with it — you can download it from my GitHub repository.

0. Setup and Import libraries

Open up a new Colab notebook, and start by selecting to run this code on a GPU (it’s free!). You can do this by going into the Runtime menu > Change runtime type > Hardware accelerator.

We will be using the fastai text library to classify out items.

from fastai.text import *
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2# Optional: use this line if you want to remove Pandas'
# default concatenation of long text in columns
pd.set_option('display.max_colwidth', -1)

1. Import the data

Since I am using Google Colab, I chose to upload my label_StackOverflow.txt and text_StackOverflow.txt files to my Drive, and mount my drive on Colab using these lines:

from google.colab import drive
drive.mount('/content/gdrive')

When you run this codeblock, you will be presented with an authentication link to let Colab access your drive.

Alternatively, you can choose to upload those files manually. Here’s a handy tutorial for it: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

Next, we will import the text files, load them up into Pandas dataframes and combine them into a single dataframe:

# Change the paths to point to where you stored your dataset.
text_path = 'gdrive/My Drive/Developer/Datasets/stackoverflow-dataset/title_StackOverflow.txt'
label_path = 'gdrive/My Drive/Developer/Datasets/stackoverflow-dataset/label_StackOverflow.txt'df_text = pd.read_csv(text_path, sep='\t', names=['text'], header=None)
df_label = pd.read_csv(label_path, sep='\t', names=['label'], header=None)df = pd.concat([df_label, df_text], axis=1, sort=False)
print('Length of dataset: '+str(len(df.index)))
df.head()

[Optional] Update label column to show the label text itself, rather than the number:

This step is for convenience, and it should not affect your ML model at all. If you decide to run this block of code, FastAI predictions will show as the actual text, instead of just a number. I do not recommend you do this on a large dataset.

mapping = {
  1: 'wordpress',
  2: 'oracle',
  3: 'svn',
  4: 'apache',
  5: 'excel',
  6: 'matlab',
  7: 'visual-studio',
  8: 'cocoa',
  9: 'osx',
  10: 'bash',
  11: 'spring',
  12: 'hibernate',
  13: 'scala',
  14: 'sharepoint',
  15: 'ajax',
  16: 'qt',
  17: 'drupal',
  18: 'linq',
  19: 'haskell',
  20: 'magento'
}df['label'] = df['label'].map(mapping)df.head()

With this, you are ready to get into the meat of the code!

2. Create train & validation datasets and FastAI data bunch

Splitting our dataset into train and validation sets is an important part of setting up an ML model.

You have the ability to control what the split should be, by setting the number for test_size. In this case, by setting it to 0.3, we are doing a 70:30 training:validation split. If you have a smaller dataset, it may be helpful to increase the size of the training dataset.

from sklearn.model_selection import train_test_splitdf_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.3)df_trn.shape, df_val.shape

Next, we will setup our data in the format that FastAI requires it to be in. FastAI provides simple functions to create Language Model and Classification “data bunch”.

Creating a data bunch automatically results in pre-processing of text, including vocabulary formation and tokenization.

TextLMDataBunch creates a data bunch for language modelling. In this, labels are completely ignored. Instead, data is processed so that the RNN can learn what word comes next given a starting word. Read the documentation here.
TextClasDataBunch sets up the data for classification. Labels play a key role here. We can also set the batch size for learning by changing the bs parameter. Read the documentation here.

# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")# Classifier model data
data_clas = TextClasDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "", vocab=data_lm.train_ds.vocab, bs=32)

You can print out a sample of the batch using this line:

data_clas.show_batch()

The xx___ tags represent the aspects of language in a way that the computer can understand. The xxbos tag marks the beginning of a sentence. The xxmaj tag is used to imply that the first letter of the next word is capitalized.

With this in place, we are ready to create a language model and classify!

3. Create and Train the Language Model

Creating a language model with the aforementioned AWD-LSTM model is done using:

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

data_lm is the language model data bunch
AWD-LSTM is the model architecture
drop_multi is the drop-out.

Next up, let’s find the optimal learning rate to train our language model on:

learn.lr_find()
learn.recorder.plot(suggestion=True)
min_grad_lr = learn.recorder.min_grad_lr

lr_find() is a built in fast.ai function that runs a few epochs on the model to plot loss, and then calculate the minimum gradient.

Now, let’s use this learning rate to train the language model:

learn.fit_one_cycle(2, min_grad_lr)

We can do a few more epochs after unfreezing all the layers. This process will train the whole neural network rather than just the last few layers.

# unfreezing weights and training the rest of the NN
learn.unfreeze()
learn.fit_one_cycle(2, 1e-3)

Our language model only achieved around 33% accuracy, but that is okay. This accuracy represents how well the model does at predicting the next word, given one word. And 33% means that 1 out of 3 times, the model accurately predicts the next word. Pretty impressive!

You can have some fun playing with the language model… here, we can ask the model to predict what comes after “How do”, till 10 words:

learn.predict("How do", n_words=10)

Clearly, the sentence generated is not very meaningful, but it is grammatically accurate.

Finally, let’s save the language model encoder so that we can load it later in our classifier:

learn.save_encoder('ft_enc')

4. Using the Language Model to Train the Classifier

Creating and training the the text classifier is very similar to training the language model.

Start by creating the text_classifier_learner with the data_clas DataBunch and the AWD_LSTM architecture. Then, you can load the language model encoder.

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')

Let’s again find the optimal learning rate to start with:

learn.lr_find()
learn.recorder.plot(suggestion=True)
min_grad_lr = learn.recorder.min_grad_lr

To train the classifier, we will use a technique called gradual unfreezing. We can start by training the last few layers, then go backwards and unfreeze and train layers before. We can use the learner function learn.freeze_to(-2) to unfreeze the last 2 layers.

We will also use learn.recorder.plot_losses() to track our loss function over the epochs.

learn.fit_one_cycle(2, min_grad_lr)

learn.recorder.plot_losses()

learn.freeze_to(-2)
learn.fit_one_cycle(4, slice(5e-3, 2e-3), moms=(0.8,0.7))

learn.recorder.plot_losses()

Finally, let us unfreeze all layers and train the model at a low learning rate.

learn.unfreeze()
learn.fit_one_cycle(4, slice(2e-3/100, 2e-3), moms=(0.8,0.7))

We have a model with 85% accuracy. This is really good, given that we only spent about 15 min training on our data. With more epochs and better hyper-parameter tuning, it is possible to improve this score by 2–5%.

At this point, we have out text classification model!

5. Analyzing our results

You can plot the confusion matrix for our predictions on the test set:

preds,y,losses = learn.get_preds(with_loss=True)
interp = ClassificationInterpretation(learn, preds, y, losses)
interp.plot_confusion_matrix()

The diagonal represent correct predictions and the bright blue represents that these boxes have a high value compared to the rest of the boxes. This means that our classifier was able to correctly classify most of the test set correctly.

You can also use

interp.most_confused()

to find the categories that the classifier gets confused on the most.

6. Predictions!

We are finally ready to use our model to predict the category of any sentence:

learn.predict("homebrew not working")

Whoo Hoo! We have a working model to classify StackOverflow questions into multiple categories.

7. Export the model

We can export and use our model using this:

learn.export()

You will find the export.pkl file in your Colab Files tab:

We can use this exported model in our own Flask API. For the sake of keeping this article short, I will not be covering it in this tutorial, but I am in the process of writing another article about it which I will link as soon as it is published. Stay tuned!

Using the FastAI library is one of the easiest ways to train a state-of-the-art text classifier. While this tutorial is a good start, make sure you play aorund with the training hyperparameters to get the best out of your model. There is so much potential to uncover with FastAI, and I am excited to put down more tutorials on this topic.

Thanks for reading!

GitHub: https://github.com/aditya10/ULMFiT-fastai-text-classifier