Explosion AI just released their brand new nightly releases for their natural language processing toolkit SpaCy. I have been a huge fan of this package for years since it allows for rapid development and is easy to use for creating applications that can deal with naturally written text.
There have been a plethora of great articles in the past that showcased SpaCy’s API for the 2.0+ release. The recent changes to their API also affect most tutorials which are now broken with the newly released Spacy V3. I like the changes and want to show how simple it has gotten to train a text classifier with very few lines of code.
In the first step, we need to install some packages:
pip install spacy-nightly
pip install ml-datasets
python -m spacy download en_core_web_md
Ml-Datasets is a curated repository of datasets from Explosion AI that also comes with a simple way to load the data.
We will use this library to get data to train our classifier.
Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets. The…
Let's Build a classifier
The full code is also available in this GitHub repository:
SpacyV3 Text Categorizer Tutorial GitHub is home to over 50 million developers working together to host and review…
We need to set up everything first:
import spacy# tqdm is a great progress bar for python
# tqdm.auto automatically selects a text based progress
# for the console
# and html based output in jupyter notebooks
from tqdm.auto import tqdm# DocBin is spacys new way to store Docs in a
# binary format for training later
from spacy.tokens import DocBin# We want to classify movie reviews as positive or negative
from ml_datasets import imdb# load movie reviews as a tuple (text, label)
train_data, valid_data = imdb()# load a medium sized english language model in spacy
nlp = spacy.load(“en_core_web_md”)
Then, we need to turn the text and the labels into neat SpaCy Doc Objects.
this will take a list of texts and labels
and transform them in spacy documents
data: list(tuple(text, label))
docs = 
# nlp.pipe([texts]) is way faster than running
# nlp(text) for each text
# as_tuples allows us to pass in a tuple,
# the first one is treated as text
# the second one will get returned as it is.
for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
# we need to set the (text)cat(egory) for each document
doc.cats["positive"] = label
# put them into a nice list
Now we only need to transform our data and store it as a binary file on the disc.
# we are so far only interested in the first 5000 reviews
# this will keep the training time short.
# In practice take as much data as you can get.
# you can always reduce it to make the script even faster.
num_texts = 5000# first we need to transform all the training data
train_docs = make_docs(train_data[:num_texts])
# then we save it in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")# repeat for validation data
valid_docs = make_docs(valid_data[:num_texts])
doc_bin = DocBin(docs=valid_docs)
Next, we need to create a configuration file that tells SpaCy what it is supposed to learn from our data.
Explosion AI created a tool to quickly make a base configuration file: https://nightly.spacy.io/usage/training
In our case, we would choose “Textcat” under components, CPU-preferred in hardware, and “Optimize-for”: efficiency. Usually SpaCy will provide sane defaults for each parameter. They won't be the best parameters for your problem, but they will work just fine for most data.
We need to change paths for our train and validation data:
train = "data/train.spacy"
dev = "data/valid.spacy"
In the next step, we need to turn our base-configuration into a full configuration. Spacy will automatically fill all missing values with their default parameters:
python -m spacy init fill-config ./base_config.cfg ./config.cfg
Finally, we can fire up the training in the CLI:
python -m spacy train config.cfg --output ./output
For each training step, it will yield an output with its loss and accuracy. The loss tells us how big the mistakes of the classifier are and the score tells us how often the binary classification is correct.
E # LOSS TOK2VEC LOSS TEXTCAT CATS_SCORE SCORE
— — — — — — — — — — — — — — — — — — — — — — — — —
0 0 0.00 0.25 48.82 0.49
2 5600 0.00 1.91 92.54 0.93
Running the classifier on our own input
The trained model is saved in the “output” folder. Once the script is done, we can load the “output/model-best” model and check the prediction for new inputs:
import spacy# load thebest model from training
nlp = spacy.load("output/model-best")text = ""
print("type : ‘quit’ to exit")# predict the sentiment until someone writes quit
while text != "quit":
text = input("Please enter example input: ")
doc = nlp(text)
if doc.cats['positive'] >.5:
print(f"the sentiment is positive")
print(f"the sentiment is negative")
We did not use any pre-trained vectors for this text-classifier, and we probably won't get representable scores “how good” the review is. We will get a binary answer: Is the sentiment of the text input greater than 0.5 it is considered positive.
If we enter a text that is different from the data we trained the classifier on, the output could make no sense:
Please enter example input: i hate mondays
the sentiment is positive
Steps to improve the classifier from here:
1. Train on more data:
We only used 5000 texts, which is only a fifth of the whole corpus. We can change our script easily to get more examples. We could even try to get data from different resources or scrape rating websites ourselves.
2. Train for more steps:
Currently, our script stops either after 1600 training steps without finding a better “solution” on the validation data or after 20'000 steps in total. In our case, a step is a forward pass, making a prediction, and a backward pass, correcting the neural network, so the error between the prediction and the label (loss) gets smaller. We can increase the values [patience, max_steps, and max_epochs] and see if the optimizer can find better weights for our network later in the training.
2. Use pre-trained word-vectors.
By default, the training in SpaCy is using a Tok2Vec layer. It uses features of the word like its length to generate a vector on the fly. The advantage is that it can handle previously unseen words and come up with a numerical representation for them. The disadvantage is that its embedding does not represent its meaning.
Pretrained word-vectors are numerical representations for each token that are derived from large amounts of text and try to embed the meaning of a word in a high dimensional space. This can help to group semantically similar words together.
3. Use a transformer model.
The Transformer is a “newer” architecture that includes the context of a word into its embedding. SpaCy V3 now supports models like Bert which can help to boost the performance even further.
4. Detect outliers in the input.
We trained our network on movie reviews. That does not mean that the model can tell you if a cooking recipe is good or bad. We might want to check if we should make a prediction on the input data, or return that the data is too different from the training to make a meaningful prediction. Vincent D. Warmerdam made some great talks about this matter like “how to constraint artificial stupidity”.
I am also looking forward to the upcoming videos of Ines and Sofie which will bring more insights into the way SpaCy V3 can be used.