Building a custom Named Entity Recognition model using SpaCy

Eric Landstein
Feb 25, 2020 · 10 min read

Named Entity Recognition (NER) is the information extraction task of identifying and classifying mentions of locations, quantities, monetary values, organizations, people, and other named entities within a text. It is a core component of many natural language processing (NLP) applications.


In this blog, I will review the technologies and steps that are involved in creating a custom Named Entity Recognition model using spaCy’s Named Entity Recognition library. ( I will annotate a custom data set containing submissions from the Subreddit called HardwareSwap ( Then, I will use the data to train my model to label entities in the submissions, such as product, price, location, zip code, product condition, and URL. Below, you can see an example output of my fully trained model with predictions for the target labels.

Fully trained model predictions

Recurrent Neural Networks:

Recurrent Neural Networks (RNNs) are a family of neural networks that operate on sequential data. They take as input a sequence of vectors (x1, x2…, xn ) and return another sequence that represents some information about every step in the initial sequence. Although they can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs. And so, Long Short-Term Memory Networks (LSTMs) have been designed to combat this issue by incorporating a memory cell to capture long-range dependencies. Using several gates, LSTMs control the proportion of input that is given to the memory cell, as well as the proportion from the previous state that should be forgotten.(Source:

For a given sequence containing n words (x1, x2, . . . , xn), each represented as a d-dimensional vector, an LSTM computes a representation vector (ht) of the left-hand context of the sentence at every word t. Naturally, generating a representation vector of the right-hand context should also yield useful information, and this can be achieved using a second LSTM that reads the same sequence in reverse. We will refer to the former as the “forward LSTM” and the latter as the “backward LSTM.” These are two distinct networks with different parameters. A pair consisting of a forward LSTM and a backward LSTM is referred to as a bidirectional LSTM (Graves and Schmidhuber, 2005).

Embed, Encode, Attend, and Predict

“When people think about machine learning improvements, they usually think first of efficiency and accuracy, but the most important dimension is generality. If you want to write a program to flag abusive posts on a social media platform, you should be able to generalize the problem to “I need to take text and predict a class ID.” It shouldn’t matter whether you’re flagging abusive posts or tagging emails that propose meetings; if two tasks take the same type of input and produce the same type of output, then we should be able to reuse the same model code and get different behavior by plugging in different data — like playing different games that use the same engine.” Embed, Encode, Attend, Predict refers to Mathew Honnibal’s conceptual framework for deep learning for natural language processing. I highly recommend you check out the full paper at .

Using the “embed, encode, attend, and predict” playbook, we can make accurate statistical predictions about named entities.


Most neural networks models will begin by breaking up a sequence of strings into individual words, phrases, or whole sentences: a process known as tokenizing. For example, if we are analyzing text in an ASCII format, there will be 256 possible values. The value for “‘a”’ would be a vector of 0, the value for “‘b”’ would be a vector of 1, the value for “c” would be a vector of 2 and so on.



One of the many problems in NLP is how to understand a word’s context. In some cases, a word can have completely different meanings depending on the surrounding words. For example, the word “crane” can be used in the sentence “That bird is a crane,” referencing an animal, or “They had to use a crane to lift the object,” referencing a piece of machinery. In these two sentences, the spelling of “crane” is the same, but its meaning is different.

To solve this problem, we can encode each sequence in a row’s vector represent each token (in word, phrase, or sentence form) in the context of the rest of the sentence.


The technology used for this purpose is a bidirectional RNN, which requires two steps. The first step is a forward pass (taking in words from left to right), and the second part is a backwards pass (taking in words from right to left). Unlike humans, computers need to take in token input from both directions to get the full context. To get the full vector for each token, the system adds the forward pass and the backwards pass together.

The important point here is that a full vector represents a token in the context of the phrase, sentence, or paragraph. This is a big issue that bidirectional RNNs have been able to solve.


The “attend” step takes the matrix that was produced in the “encode” step, which is a combination of the vectors that were outputted from the forward pass and the backwards pass. This matrix is taken and shrunk down into a single vector so that it can be passed through a standard feed-forward RNN. A feed-forward RNN is one directional compared to the previous RNN in the encode step which was bidirectional.


When the matrix is reduced to a vector, some information is necessarily lost. This is why the context vector is so important: because it tells the algorithm which information can be discarded. In a book, for example, a word on page 3 is most likely not going to have a huge contextual impact on a word on page 7. The “attend” step tells the algorithm which information to let go of to avoid information overload.


The output layer will be of a fixed size, depending on how many labels you are predicting. For example, if you are predicting whether a sentence is about cats or dogs, the output layer will have two labels: “Cat” and “Dog.” Once the text has been reduced to a single vector, the model will predict the target label.


As important as the statistical model is, it’s equally important to have training data that cover the entities you are trying to label, and this is one of the reasons why Named Entity Recognition has been slow to catch on. So that you can make the best use of Named Entity model, it is extremely important to ensure that the training data are relevant and up to date. In order to get the proper training data for specific problems, you will need training sets that are annotated with the entities you are trying to label.


In order to get the proper training data for this project, I have been using the Python Reddit API Wrapper (PRAW) and storing all submissions from the Subreddit r/hardwareswap in a SQL server. I have reviewed these steps in a previous article that you can find here. At the time of writing this, I have about 16,000 submissions, which is plenty of data to start training with. As stated above, a crucial part of solving a Named Entity Recognition problem is having good training data.

The code for the below steps can be found on SpaCy’s website at, and on my Github at On they do a great job of providing examples and use cases.

For this project, I will start with a blank English NER model. The below code sets the model to “none,” meaning that there is no model yet; output_dir indicates the location where the model will be saved; and n_iter indicates how many times the model will train on the data.

In this example, because there is no model, a blank English model will be created.

Since this is a supervised learning problem, the training data will need to be annotated manually. This requires you to go submission by submission and label each part of speech or category that you want your model to identify as a specific entity.

First Annotated Text:

The annotating process is pretty straightforward. I set the first piece of training data to be equal to x (which can be seen just after the left-hand parenthesis). Then, I will just need to identify what indexes an entity falls between and input the label. For example, the PRODUCT category falls between indexes 8–28, as well as indexes 74–89. As you can see marked off in the above picture.

After that, I am ready to train the model.

The code below will train on the data using the blank English NER model I created above:

Testing the model:

The below code loads in the now-trained model based on the one annotated submission, x. Since I annotated only one submission, I can’t expect any significant results, but the model was able to identify the Location:

To build on the model, I will annotate a second submission, y, and add it to the training data.

Submission y:

All training data, now including submissions x and y:

After I’ve trained the model a second time with both submissions, it identifies Location, Price, URL, and Username correctly. However, it still does not identify Product correctly:

In order to make the model’s predictions more accurate, I will need to give it more training data.

Annotating submission z:

Retraining the model with training data x, y, and z:

After I’ve trained the model a third time with submissions x, y, and z, it labels Location, Price, and URL correctly. For the most part, the Product labels are correct, but they also capture a few other words. Additionally, the Condition label does capture the condition, but it also captures the specs, which is incorrect.

The above tests were all run on the training data. To see how well the model is really performing at this point, it is necessary to run a test on some unseen data. To do this, I will run the model on the following text:


Considering the limited training data, it is impressive what this custom SpaCy NER model is capable of labeling correctly. I recommend that you train with at least a few hundred annotated texts before running a model.

Annotating the data is a long and tedious process, and there are tools out there that will make the process more efficient. After spending a lot of time annotating the data from r/hardwareswap and improving my custom model, the model is able to make predictions that are very accurate. Below are a few examples:

For additional examples and to check out my progress with this project check out the github repo.


The Startup

Get smarter at building your thing. Join The Startup’s +791K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Eric Landstein

Written by

Data Scientist

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +791K followers.

Eric Landstein

Written by

Data Scientist

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +791K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app