Custom NER for Extracting Disease Entities
Learn how to extract disease names from unstructured text using spaCY’s NER.
With the ever-increasing digital data, it has always been a daunting task to extract useful information from loads of data especially when it comes to dealing with unstructured text.
So, instead of going through every single article or any other text resource line by line how cool it would be if we could extract meaningful entities / specific details with just a few lines of code.
This is where NER comes into play!
What is NER?
Named Entity Recognition is the process of extracting predefined entities like the name of a person, location, organization, time, date, etc. from unstructured text data.
Suppose that we have the below text,
“Facebook is an American online and social networking service founded in 2004 by Mark Zuckerberg.”
Performing NER on the above text will extract entities like ORG, PERSON, YEAR as below,
Facebook ORG
2004 YEAR
Mark Zuckerberg PERSON
The entities extracted can then be grouped together and used for a wide range of information extraction tasks.
In this article, I have used spacy to perform Custom Named Entity Recognition to identify disease names from a text article.
So let's get started!
Custom NER using spaCY
SpaCy is a free open-source library for Natural Language Processing in Python which can be used for a wide range of NLP tasks like NER, POS tagging, dependency parsing, word vectors, etc.
Let's first install and import spaCY
#importing necessary libraries
!pip install -U spacy
import spacy#for NER visualization
from spacy import displacy
Load spaCY and check if it has NER
nlp=spacy.load('en_core_web_sm')
nlp.pipe_names['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Now let's perform NER using spaCY‘s default entities on a small text data about Tokyo Olympics.
#text with details of tokyo olympics
text_olympics = """ The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.
Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013.[2] Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games
"""#ner on tokyo olympics text
doc=nlp(text_olympics)
for ent in doc.ents:
print(ent.text,ent.label_)
The entities extracted are,
The 2020 Summer Olympics WORK_OF_ART
Japanese NORP
2020年夏季オリンピック CARDINAL
Hepburn PERSON
Nisen Nijū-nen PERSON
Kaki Orinpikku PERSON
the Games of the XXXII Olympiad ( EVENT
Dai Sanjūni-kai PERSON
Orinpikku Kyōgi Taikai PERSON
Tokyo GPE
2020 DATE
23 July to 8 August 2021 DATE
Tokyo GPE
Japan GPE
21 July DATE
Tokyo GPE
the 125th IOC Session FAC
Buenos Aires GPE
Argentina GPE
7 September 2013.[2 DATE
24 July to 9 August 2020 DATE
2021 DATE
March 2020 DATE
first ORDINAL
the Olympic Games
EVENT
For a description of entities,
#description of entities
spacy.explain("EVENT")'Named hurricanes, battles, wars, sports events, etc.'
As we can see although spaCY has done a pretty good job in extracting entities let's try the same for a different text data and see how it looks.
text_disease = """Based on the statistics from WHO and the Centers for Disease Control and Prevention,here are the five most common infectious diseases.
According to current statistics, hepatitis B is the most common infectious disease in the world, affecting some 2 billion people -- that's more than one-quarter of the world's population. This disease, which is characterized by an inflammation of the liver that leads to jaundice, nausea, and fatigue, can lead to long-term complications such as cirrhosis of the liver or even liver cancer.
.
.
.
however, there were still 8.6 million new cases of TB reported last year, and roughly one-third of the world's population carries a latent form of TB,
meaning they've been infected but aren't ill and can't transmit the disease yet. """
The entire text data can be downloaded here.
NER on the above disease_text,
doc=nlp(text_disease)
for ent in doc.ents:
print(ent.text,ent.label_)WHO ORG
the Centers for Disease Control and Prevention ORG
five CARDINAL
some 2 billion CARDINAL
more than one-quarter CARDINAL
about 350 million CARDINAL
Malaria GPE
more than 500 million CARDINAL
annually DATE
between 1 million and 3 million QUANTITY
second ORDINAL
annual DATE
Malaria GPE
TB ORG
As we can see here the entities relating to disease names are wrongly classified as GPE or ORG. For eg, Malaria is classified as GPE which doesn't make sense.
Let's see how we can improve the above model and make spaCY classify diseases by using a new label DISEASE.
Updating NER with new label “DISEASE”
# Load pre-existing spacy model
import spacy
nlp=spacy.load('en_core_web_sm')
# Getting the pipeline component for ner
ner=nlp.get_pipe("ner")
ner
Define TRAIN_DATA with custom-defined entities,
TRAIN_DATA = [(text_disease , {"entities": [(654,661, 'DISEASE'),(1890,1896, 'DISEASE'),(2311,2323, 'DISEASE'),
(1382,1391, 'DISEASE'), (406,414, 'DISEASE'),
(2539,2543, 'DISEASE'), (168,177, 'DISEASE'),
(2325,2327, 'DISEASE'), (518,524, 'DISEASE')]})]
The custom-defined entities should be specified in the above format. It should be fed with a text and a dictionary with keys entities corresponding to that particular text. The ‘entities’ key has the list of entities identified in that particular text. Each entity tuple in the list has the span of words (eg:(654,661) is the span of the word malaria in the text_disease data.) and their corresponding label.
You can either specify the entities manually or can go for any of the NER annotation tools listed in this article here.
To find the span of a word manually below lines of code can be used.
string = text_disease
match = re.search("Malaria", string)
print('%d,%d' % (match.start(), match.end()))654,661
Now let's add the new labels to NER,
# Add the new labels to ner
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
Training the NER
Here we will be using the existing spaCY model for training the entities instead of a blank spaCY model. If an existing model is being used, we have to disable all other pipeline components during training using nlp.disable_pipes so that only the NER gets trained and others are ignored.
Disable pipeline components that need not be changed and let's begin training!
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]#import necessary libraries for training
import random
from spacy.util import minibatch, compounding
from spacy.training import Example# Begin training by disabling other pipeline components
with nlp.disable_pipes(*unaffected_pipes) :
sizes = compounding(1.0, 4.0, 1.001)
# Training for 100 iterations
for itn in range(100):
# shuffle examples before training
random.shuffle(TRAIN_DATA)
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=sizes)
# dictionary to store losses
losses = {}
for batch in batches:
for text, annotations in batch:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], drop=0.5, losses=losses)
print("Losses",losses)
We have used the nlp.update, to make a prediction for each of the training examples. It then checks the prediction with the trained entity in annotations. If the prediction is right it saves the weights and if the prediction is wrong it adjusts its weights so that it can make a better prediction next time.
Training is an iterative process in which the model’s predictions are compared against the reference annotations in order to estimate the gradient of the loss. The gradient of the loss is then used to calculate the gradient of the weights through backpropagation. The gradients indicate how the weight values should be changed so that the model’s predictions become more similar to the reference labels over time.
Testing the NER
Now let's test the NER for a new text related to diseases and see the results
# Testing the NER
doc = nlp(“Hepatitis is a disease which causes inflammation of the liver and it can also cause jaundice.Tuberculosis is caused by a bacterium called Mycobacterium tuberculosis.AIDS is the late stage of HIV infection that occurs when the body’s immune system is badly damaged because of the virus.Typhoid is a bacterial infection that can lead to a high fever, diarrhea, and vomiting.Cancer is a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body.Chikungunya is a viral disease transmitted to humans by infected mosquitoes.Pneumonia is an infection that inflames the air sacs in one or both lungs.Malaria is a disease caused by a parasite.” )
print(“Entities”, [(ent.text, ent.label_) for ent in doc.ents])
The entities extracted are,
Entities [(‘Hepatitis’, ‘DISEASE’), (‘Tuberculosis’, ‘DISEASE’), (‘AIDS’, ‘DISEASE’), (‘Cancer’, ‘DISEASE’), (‘Chikungunya’, ‘DISEASE’), (‘Pneumonia’, ‘DISEASE’), (‘Malaria’, ‘DISEASE’)]
Not bad!
As we can see the model has correctly extracted most of the entities. Although I have not used Pneumonia, Chikungunya during training it has done a good job predicting them as well.
Hope my article helped you understand how to perform Custom Named Entity Recognition using NER.Try yourself for different datasets and let me know your comments and feedback below.
Source code can be found here.
Thank you for reading!