A tour of awesome features of spaCy (part 2/2)

Nuszk
Eliiza-AI
Published in
5 min readMay 30, 2019

--

In the first part of this overview of spaCy we went over the features of the large English pretrained model that spaCy comes with. In this part I would like to discuss using pretraining to transfer learning to subsequent machine learning tasks. Full code for the post can be found on GitHub. If you are new to NLP perhaps you might consider reading this intro first.

The plan for this post is to train a TextCategoriser by first pretraining the model with the text from the training data. Pretraining will give us weights that we can use to initialise the pipeline component we want to train later. We will use the vectors-only model for pretraining, so run the following command in terminal to download it.

python -m spacy download en_vectors_web_lg

Even though pretrained models do not come with a TextCategorizer component spaCy has a blank native spot reserved for it with relevant infrastructure already in place. I will train a TextCategorizer pipeline component using a dataset that I derived from Medium tags data published on Kaggle. The dataset that I used consists of 23833 Medium story titles and tags of 1 or 0 depending on whether a story is related to ML/AI or not, respectively. The classes are balanced. One third of these went into the training set and the rest into the testing set. I used the training set titles (over 120,000 words total) for pretraining.

Pretraining

The pretraining bit was added recently with the release of spaCy 2.1 and allows the use of Language Modelling with Approximate Outputs (LMAO), or training of a transformer for the ‘token-to-vector’ (tok2vec) layer of pipeline components with the language modeling objective of predicting word vectors rather than words.

Following the spaCy 2.1 release notes I ran two pretraining jobs, the first one can be used with any model that does not use vectors, such as en_core_web_sm or a blank model, and the second one can be used with en_core_web_md/lg.

python -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-modelpython -m spacy pretrain texts.jsonl en_vectors_web_lg ./pretrained-model-vecs --use-vectors

The texts.jsonl has to be in a line separated json format with {'text':'training text'} on each line. I used jsonlines library for this while spaCy’s website recommends using srsly.

I used the default parameters that can be checked with the command below. These parameters align with 'simple_cnn' architecture that is used for training text categoriser after pretraining. The default number of iterations is 1000 but some of the interim models are saved during the pretraining process. So, below I report results for the final models and for models after only 50 iterations.

python -m spacy pretrain -h

Training

The pretraining returns weights that can be used to initialise a training model. There are a few steps to set up a training job. I started with the code from spaCy TextCategorizer user guide and adapted it as needed. First we have to load a model that we are going to train, create a pipeline component to train and add it to the pipeline as well as add categories to train for.

My training and test sets come in the format below. But the data has to be a list of (text, category dictionary) pairs. So we do a bit of preprocessing.

Now we are ready to train the TextCategorizer. We just disable the other pipeline components as we train.

I trained six models: three blank models and three en_core_web_lg models. For each group one model was trained without pretraining, one with corresponding pretrained weights of 50 iterations and one with corresponding pretrained weights of 1000 iterations. Then I evaluated training loss and accuracy, precision, recall and F1 scores on the test set for each of the five training iterations. Here we report only loss, accuracy and F1 scores but the full results as well as the full training code can be found on GitHub.

We notice the following:

  1. There is a small but consistent improvement with pretraining for both the blank model and en_core_web_lg model.
  2. Doing only 50 iterations of pretraining does not seem to have as much effect for the large model as for the blank model.
  3. Pretraining improves the blank model more than the large english model.
  4. Starting from blank model without pretraining, just using a larger model and just using the pretraining give comparable results.
  5. The best model is the large English model with pretraining.
  6. The scores decrease with iterations which suggests that the models are overfitting the training data and looking at the training loss also supports this.

Considering that our pretraining data was not very large, only about 120,000 words, and that it was given as sentence (titles, that might not even be sentences) chunks rather than recommended paragraph-sized chunks, the pretraining with 1000 iterations results in a reasonable improvement.

Conclusion

spaCy is a fast and flexible NLP library. We can add a new pipeline component, replace the existing ones or fill in one of the native empty component spots. One such spot is reserved for a TextCategorizer with related infrastructure already in place. Even though it is a little hard to navigate through documentation when it comes to training and pretraining there are examples out there to learn from and follow. While each of the pretraining jobs took about 5.5 hours (~6,000 words per second) on a four core CPU, LMAO is still feasible to experiment with. The pretraining done with only about 120,000 words resulted in a slight but consistent improvement in the TextCategorizer performance.

--

--