Using spaCy and Prodigy to train an Entity Recognition Model

Published in

Analytics Vidhya

9 min readJun 15, 2020

Note: This article assumes familiarity with NLP, Word Vectors, etc.

I love Natural Language Processing (NLP). I find humans fascinating and since humans are largely a jumble of language, I’m drawn to studying how they use it, seeing what sense I can pull from it.

When I was first introduced to NLP, I learned it via NLTK and scikit-learn, both of which left me unimpressed. At one point I remember asking, “wait a second, we’re just parsing strings and counting words? That’s it?” TfidfVectorizer from scikit-learn made it a little more interesting, but I wanted to do more. Enter spaCy and Prodigy.

SpaCy, an open source python library, has become my go-to for NLP. Although older libraries like NLTK continue to hang on, what spaCy offers far exceeds its capabilities.

One of the real difficulties with NLP is that it’s really hard to know from the outset whether your experiment will be successful. With Prodigy and spaCy, if I have a good set of data formatted correctly and a good idea of what I want to do, I can know if what I’m doing is going to work in about an hour. I can create around 500 context driven Entity example annotations, use those to train a model using more than 500 examples of text, and then run metrics on the model to evaluate how well it performs, all within an hour. That’s pretty incredible and it allows me to try out ideas that previously would have taken far too much time to justify.

Prodigy is a wrapper for spaCy, created by the developers of spaCy. Although it is not free, if you are serious about NLP it’s more than worth the cost, especially since the money goes directly toward the further development of spaCy.

To give you a taste of spaCy and Prodigy, I’ll provide a brief overview of a project I did last night in about an hour. To do the following a number of libraries and data packages are needed. These can all be downloaded from this Prodigy github page. This github page outlines a tutorial that’s available online, I recommend checking it out.

Figure 1: The Spacy NLP Pipeline (picture from Spacy.io)

The spaCy pipeline is composed of a number of modules that can be used or deactivated. In our case we are going to be creating an Entity Recognition Model, so we will be training the ner portion of the spaCy pipeline.

Typically with an NLP model the better you can conceptualize what you are looking for, the better you can train an algorithm to find it. Problems that involve searching for concrete objects tend to be the easiest models to train (e.g. food types, car types, etc). A good rule of thumb: the better you can measure something the better you can model it.

I am very interested in trying to detect hoaxes and disinformation in online social media. This is something that is very hard to measure or quantify, hoaxes and disinformation can be very nuanced. When you or I read something we bring along all of our experiences to contextualize whatever it is we are reading. A computer isn’t able to conceptualize topics like our brains can, we have to teach the computer to do this and doing so is often very challenging, but these are the fun problems.

File Preparation

spaCy requires all files be formatted as .jsonl. Fig. 2 shows an example of a .jsonl formatted training file for spaCy. It is composed of one dictionary per line. Each dictionary key must be “text” with the associated values being the training text. Other formatting options are available, but this is the bare minimum required.

The above data is composed of 3000 Reddit post titles taken from the r/Conspiracy subreddit. When you train an NLP model, you want to teach the algorithm what the signal looks like. The better I understand what the signal looks like, the better I can teach an algorithm to look for it. Since I am trying to build a model to detect misinformation, I figured a great place to get a strong signal would be on a subreddit devoted to spreading misinformation. We’ll see that even in this environment, capturing the signal of misinformation is still really challenging.

We will be using the sense2vec word vector package. sense2vec is a twist on the word2vec models that are used to create word vector spaces that include semantic context. sense2vec incorporates neural word representations to gather even more context.

The modeling process is as follows:

Create seed terms. We use these as a starter for creating our Entity annotations.
Gather our Entity annotations using Prodigy and save them to a .jsonl file.
Use our Entity annotations to train the ner portion of the spaCy pipeline. We train the model using the actual text we are analyzing, in this case the 3000 Reddit submission titles.
Pickle our model and run a Train-Test-Split.
Run a Train-Curve test to see if the model will benefit from further training.
Decide whether we are satisfied with the model metrics and if continuing on the current path is justified.

Creating Entity Annotations

Figure 3: Prodigy terms.teach Method Usage (pictures taken from prodi.gy)

$ prodigy sense2vec.teach hoax_terms ./s2v_reddit_2015_md --seeds "conspiracy, planes, hoax, sharks, ants, lies, rumor, floods, hurricane, evacuation, fake"

Prodigy has a number of very useful methods that make interacting with spaCy seamless. We will use the sense2vec.teach method which follows the syntax of the terms.teach method. The dataset variable is the name of the database this method will create and store Entity annotations to. The vectors variable specifies which word list you want to use. In our case we will pass it the ./s2v_reddit_2015_md file. This file contains 1 billion words from reddit trained using spaCy’s pretrain method predicting the en_vectors_web_lg vectors (which we will use). If this pretrain method is used, the vectors they were trained on must also be used.

The seeds variable gives Prodigy a place to start. It’s like saying to Prodigy, “I’m looking for something kinda like this…” and then Prodigy focuses in on those words to search for Entity examples we want to capture.

After executing the above command you can then use your web browser for annotating, as seen here. While using the Prodigy browser tool we are in an active learning loop. Essentially spaCy gives us a word and we tell it whether it matches the Entity we are looking for. If we tell spaCy that it does match, spaCy learns from this and provides us with words that become closer and closer to the Entities we want to identify. This Prodigy tool makes collecting annotation examples very fast.

After annotating in the browser, Prodigy saves the Entity examples you’ve identified to the database you’ve specified. This is a great feature because it allows these datasets to be reused in the future.

Creating a .jsonl File of our Entity Examples

Figure 4: Prodigy terms.to-patterns Method Usage

$ prodigy terms.to-patterns hoax_terms --label HOAX --spacy-model blank:en > ./hoax_patterns.jsonl

Now that the Entity examples are stored in our database, we can use the terms.to-patterns method to create a .jsonl formatted file of our Entity annotations. Since we are creating a model from scratch, we pass the method an empty language model. We now have a formatted file of Entity annotations we can use for ner training.

Training the ner Component of the spaCy Pipeline

Figure 5: Prodigy ner.manual Method Usage

$ prodigy ner.manual hoax_conspiracy blank:en ./conspiracy_3000.jsonl --label HOAX --patterns ./hoax_patterns.jsonl

We use the ner.manual method for training. This requires us to specify a new database, a language model, the text we will be training from, the Entity Name, and the examples we’ve collected annotations for. Since we are creating a new model, we pass the method an empty language model.

After executing the above command you can then use your web browser to interactively train your model. Once again while using the browser tool we are in an active training loop except now instead of spaCy giving us words to choose from, it pulls a line of text from our training file and makes a prediction regarding what entities are present. If spaCy is wrong we correct the mistake and spaCy learns from its mistake. If training is going well, spaCy should get better and better at predicting entities. After training in the browser, Prodigy saves everything you’ve done to the database you’ve specified.

Typically I try to train about 500 text examples before calculating model metrics. While training this model I noticed that spaCy was not improving on its ability to predict Entities, its performance was pretty flat. At ~300 I decided to quit and do some quick metrics to see if continuing was worth my time.

Model Metrics

$ prodigy train ner hoax_conspiracy en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --output ./temp_model --eval-split 0.3

The train method takes all the information we’ve collected and uses it to train the spaCy pipeline. This method takes our previously created database, language model, pretraining file, an output file path, and a Test-Train split ratio. Since we’ve used sense2vec, we are required to use the accompanying word vectors en_vectors_web_lg. We will also use the pretraining model ./tok2vec_cd8_model289.bin that spaCy trained to predict the en_vectors_web_lg language model. We specify a value of 0.3 for the Test-Train split.

F-Score

By default Prodigy does ten training passes, calculating the F-Score at each step and then adjusting the model. We can see from the above metrics that my intuition was correct, this model isn’t performing very well. The question I now have to answer is: do I continue to train this model or do I try something else?

Train-Curve Metrics

$ prodigy train-curve ner hoax_conspiracy en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --eval-split 0.3

Prodigy provides a great method called train-curve specifically designed to help answer the question regarding whether to continue to train our model or try another approach. This method creates four models with 25%, 50%, 75%, and then 100% of our data. The goal is to see how well the model improves as we give it more data. We can see from the above metrics that in the last step our accuracy increases. This typically means that training the model further with more examples will increase the model’s accuracy.

Remember that we are looking for Entities that are text examples of misinformation. Even when pulling data from r/Conspiracy on Reddit, a place rife with misinformation, picking up the signal of misinformation is very difficult.

This model is good enough to continue to train. Because we pickled the model, we can pick up our training right where we left off. Going forward I think I will use a different training corpus. r/Conspiracy has treated us fairly well, but there are some other sources of misinformation that I’d like to mine. This experiment required about an hour of my time. Now I can take what I’ve learned and modify my training strategy accordingly. This is what makes Prodigy so great, fast training and quick metrics for Data Scientists looking to create and refine NLP models.

Another great benefit of NLP are the hilarious things you get to read. How else could I have discovered that camels are conspiring to control the world’s water supply?