Custom NER for Extracting Disease Entities

Learn how to extract disease names from unstructured text using spaCY’s NER.

sangeetha natarajan
Analytics Vidhya
7 min readSep 6, 2021

--

With the ever-increasing digital data, it has always been a daunting task to extract useful information from loads of data especially when it comes to dealing with unstructured text.

So, instead of going through every single article or any other text resource line by line how cool it would be if we could extract meaningful entities / specific details with just a few lines of code.

This is where NER comes into play!

What is NER?

Named Entity Recognition is the process of extracting predefined entities like the name of a person, location, organization, time, date, etc. from unstructured text data.

NER on Tokyo Olympics Wikipedia article

Suppose that we have the below text,

“Facebook is an American online and social networking service founded in 2004 by Mark Zuckerberg.”

Performing NER on the above text will extract entities like ORG, PERSON, YEAR as below,

Facebook ORG
2004 YEAR
Mark Zuckerberg PERSON

The entities extracted can then be grouped together and used for a wide range of information extraction tasks.

In this article, I have used spacy to perform Custom Named Entity Recognition to identify disease names from a text article.

So let's get started!

Custom NER using spaCY

SpaCy is a free open-source library for Natural Language Processing in Python which can be used for a wide range of NLP tasks like NER, POS tagging, dependency parsing, word vectors, etc.

Let's first install and import spaCY

Load spaCY and check if it has NER

Now let's perform NER using spaCY‘s default entities on a small text data about Tokyo Olympics.

The entities extracted are,

For a description of entities,

As we can see although spaCY has done a pretty good job in extracting entities let's try the same for a different text data and see how it looks.

The entire text data can be downloaded here.

NER on the above disease_text,

As we can see here the entities relating to disease names are wrongly classified as GPE or ORG. For eg, Malaria is classified as GPE which doesn't make sense.

Let's see how we can improve the above model and make spaCY classify diseases by using a new label DISEASE.

Updating NER with new label “DISEASE”

Define TRAIN_DATA with custom-defined entities,

The custom-defined entities should be specified in the above format. It should be fed with a text and a dictionary with keys entities corresponding to that particular text. The ‘entities’ key has the list of entities identified in that particular text. Each entity tuple in the list has the span of words (eg:(654,661) is the span of the word malaria in the text_disease data.) and their corresponding label.

You can either specify the entities manually or can go for any of the NER annotation tools listed in this article here.

To find the span of a word manually below lines of code can be used.

Now let's add the new labels to NER,

Training the NER

Here we will be using the existing spaCY model for training the entities instead of a blank spaCY model. If an existing model is being used, we have to disable all other pipeline components during training using nlp.disable_pipes so that only the NER gets trained and others are ignored.

Disable pipeline components that need not be changed and let's begin training!

We have used the nlp.update, to make a prediction for each of the training examples. It then checks the prediction with the trained entity in annotations. If the prediction is right it saves the weights and if the prediction is wrong it adjusts its weights so that it can make a better prediction next time.

Training is an iterative process in which the model’s predictions are compared against the reference annotations in order to estimate the gradient of the loss. The gradient of the loss is then used to calculate the gradient of the weights through backpropagation. The gradients indicate how the weight values should be changed so that the model’s predictions become more similar to the reference labels over time.

source:https://spacy.io/usage/training

Testing the NER

Now let's test the NER for a new text related to diseases and see the results

The entities extracted are,

Not bad!

As we can see the model has correctly extracted most of the entities. Although I have not used Pneumonia, Chikungunya during training it has done a good job predicting them as well.

Hope my article helped you understand how to perform Custom Named Entity Recognition using NER.Try yourself for different datasets and let me know your comments and feedback below.

Source code can be found here.

Thank you for reading!

--

--