Transfer Learning for Text Classification

Vikas Pandey
Naukri Engineering
Published in
7 min readJul 6, 2020

--

In recent years, there have been some breakthrough advances in Deep Learning for NLP. In this blog we will talk about one of those cutting edge advances we recently used in one of our own applications. The algorithm we will discuss is a text classification algorithm called ULMFiT and the breakthrough ideas which this algorithm introduced are ways to do transfer learning for text classification really effectively.

ULMFiT stands for Universal Language Model Fine Tuning for Text Classification and its main innovation is that it invented methods to do end to end architecture fine tuning really effectively and demonstrated its value in text classification application.

Transfer Learning

So, what is transfer learning and why is it important?

As per Wikipedia: “Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.”

Thus, using Transfer learning we can utilize the knowledge we have built or someone else has built while solving a different but related problem to better solve our own problem. Transfer learning has been really successful in Computer Vision for many years now. So, in Computer Vision a model for ImageNet classification which has 1000 classes can be used for solving any other image classification problem.

In NLP, some form of transfer learning has been common but end to end architecture fine tuning in a really effective manner wasn’t possible till now. For example, when we use pre-trained word vectors in our application we are using transfer learning. But it is a limited form of transfer learning.

In lot of problems the amount of labelled data is less and / or collecting labelled data is costly. Transfer learning helps us get better performance with relatively lesser amount of labelled data. And for the same amount of labelled data it helps us get state of the art results.

For example, ULMFiT authors showed using their transfer learning approach they could achieve the same results using 100x less labelled data. Using full data they got state of the art performance improving upon existing benchmarks by 18–24% on different datasets.

Pan and Yang (2010)

Application

We have implemented this algorithm in Universal Crawler to identify pages containing jobs. Universal Crawler is an automated generic crawler which, given the URL of a website, fetches all the pages from that website, identifies all the pages that contain jobs and then parses those pages to extract job content. The page classifier lies at the heart of Universal Crawler since it helps us identify relevant pages which is usually even less than 1% of all the pages.

The requirements are very stringent on this classifier. We need high precision as well as high recall. If precision is low there will be many non-job pages which will have to be manually filtered out at the time of publishing. And if the recall is low we will miss some job pages which we could have identified, parsed and published.

Implementation

First of all, load all the required dependencies,

from fastai import *

from fastai.text import *

from fastai.callbacks import *

Steps to be followed:

  1. Pre-Processing
  2. Build Databunch for Language Model
  3. Fine tune Language Model (ULMFit)
  4. Build databunch for classifier
  5. Train Text Classifier
  6. Visualize result

Pre-Processing

Dataset: Our data set is web pages which have been fetched by the crawler. They have been manually annotated as either containing no jobs, one job or multiple jobs. The pre-processing steps consist of steps to normalize and clean the data and they are as mentioned below,

  • Remove non-ascii encoding
  • Append spaces around special character
  • Replace multiple consecutive spaces with single space
  • Convert decimal number into ‘numberval’ token to minimize the number of unique token in train dataset
  • Store train and test data into train and test folders respectively. Within the train and test folders pages of each class are kept in their respective folders

Build Databunch

Databunch is the final data object formed after all pre-processing steps that are fed to the model. We train two models to build the text classifier: language model and the classification model. We build separate Databunch objects for each. Below are the steps to create a Databunch for language model.

  • TextList.from_folder() — It’s a list of text files. Each document is a separate text file.
  • filter_by_folder() — Say where it is﹣in this case we have to make sure we just include the train and test folders.
  • label_for_lm() — How are we going to label it? Remember, a language model kind of has its own labels. The text itself is labeled so the label for a language model ( label_for_lm ) does that for us.
  • databunch() — And create a data bunch and save it. That takes a few minutes to tokenize and numericalize.

path = Path(‘./directory containing train test folders’)

#Create Databunch

data_lm = (TextList.from_folder(Path).filter_by_folder(include=[‘train’, ‘test’]).split_by_folder(valid=’test’).label_for_lm().databunch(bs=64))

#Save Databunch

data_lm.save(‘data_lm.pkl’)

Fine tune Language Model (ULMFit)

ULMFiT achieves state-of-the-art result using novel techniques like:

  • Discriminative fine-tuning: It means that different layers’ weights are updated at different rates. Layers towards the end need more updation so are updated at a faster rate. Layers towards the start need less updation so are updated at a slower rate.
  • Slanted triangular learning rates: For every epoch the learning rate linearly increases for some number of iterations and then linearly decreases.
  • Gradual unfreezing: We first tune only the last layer. Once done, we update the last two layers, and then last three and so on.

This method involves fine-tuning a pre-trained language model (LM), trained on the Wikitext 103 dataset, to a new dataset in such a manner that it does not forget what it previously learned.

data_lm = load_data(path, ‘data_lm.pkl’, bs=64)

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

learn.lr_find()

learn.recorder.plot(skip_end=15, suggestion=”true”)

Min numerical gradient: 2.75E-02

Min loss divided by 10: 2.51E-02

The fit one cycle and cyclic momentum allows the model to be trained on higher learning rates and converge faster. The fit one cycle policy provides a form of regularization.

#update weights of last layer of neural net

learn.fit_one_cycle(1, 2.75e-02, moms=(0.8,0.7))

learn.lr_find()

learn.recorder.plot()

#Unfreeze last 2 layer and update their weights

learn.freeze_to(-2)

learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

#unfreeze all layer and update their weights

learn.unfreeze()

learn.fit_one_cycle(2, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7), callbacks=[SaveModelCallback(learn, every=’epoch’, monitor=’accuracy’)])

learn.save(‘lang_model’)

#Now finally we complete fine tuning our language model

#save language model encoder as it must be same for classifier

learn.save_encoder(‘lm_fine_tuned_enc’)

Build Databunch for classifier

Now we’re ready to create our classifier. Step one, as per usual, is to create a data bunch, and we’re going to do basically exactly the same thing

  • TextList.from_folder(path, vocab= data_lm.vocab) — But we want to make sure that it uses exactly the same vocab that is used for the language model. If word number 10 was ‘the’ in the language model, we need to make sure that word number 10 is ‘the’’ in the classifier. Because otherwise, the pre-trained model is going to be totally meaningless. So that’s why we pass in the vocab from the language model to make sure that this data bunch is going to have exactly the same vocab. That’s an important step.
  • split_by_folder() — Remember, the last time we had split randomly, but this time we need to make sure that the labels of the test set are not touched. So we split by folder.
  • label_from_folder() — And then this time we label it not for a language model but we label these classes ([‘1’, ‘2’]).
  • databunch() — Then finally create a data bunch.

data_clas = (TextList.from_folder(d_path, vocab=data_lm.vocab)

#grab all the text files in path

.split_by_folder(valid=’test’)

#split by train and valid folder (that only keeps ‘train’ and ‘test’ so no need to filter)

.label_from_folder(classes=[‘1’, ‘2’])

#label them all with their folders

.databunch(bs=64))

#save Databunch

data_clas.save(‘data.clas.pkl’)

Train Classifier

Training the classifier involves the same strategy as training the language model. We again follow the methods of Discriminative fine-tuning, Gradual unfreezing and Slanted triangular learning rates to learn a good model.

data_clas = load_data(path, ‘data_clas.pkl’, bs=32)

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)

learn.load_encoder(‘lm_fine_tuned_enc’);

learn.lr_find()

learn.recorder.plot()

#Steps in training classifier is same as in ULMFit

learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

learn.freeze_to(-2)

learn.fit_one_cycle(2, slice(1e-3,1e-2), moms=(0.8,0.7),callbacks=[SaveModelCallback(learn, every=’epoch’, monitor=’accuracy’)])

learn.unfreeze()

learn.fit_one_cycle(2, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7),callbacks=[SaveModelCallback(learn, every=’epoch’, monitor=’accuracy’)])

So with this we completed our Text Classification Training steps.

Visualize result

from fastai.widgets import ClassConfusion

interp = ClassificationInterpretation.from_learner(learn)

interp.plot_confusion_matrix()

To use this model in production first export model,

learn.export()

References

https://github.com/fastai

https://course.fast.ai/videos/?lesson=3

https://arxiv.org/abs/1801.06146

Hit the 👏 (claps) button to make the story reachable to more audience.

--

--