Clinical Named Entity Recognition Using spaCy
A Healthcare Domain-Specific NER Method with spaCy
As described in my previous article [1], Natural Language Processing (NLP) becomes a hot topic in both AI research and applications because text (e.g., English sentences) is a major type of natural language data. There are many AI methods for text data in healthcare domain such as clinical text classification, clinical Named Entity Recognition (NER), …, etc.
In [1] I used an open source clinical text dataset [2][3] to present some of the common machine learning and deep learning methods for clinical text classification.
In this article, I use the same dataset to demonstrate how to implement a healthcare domain-specific Named Entity Recognition (NER) method using spaCy [4].
The new NER method consists of the following steps:
- preprocessing dataset
- defining domain-specific entities and types
- generating annotated dataset
- creating and training NER model
- evaluating NER model
- extending NER model
- predicting and visualizing named entities
1. Preprocessing Dataset
After downloading the dataset mtsamples.csv file from Kaggle [3], the dataset can then be loaded into memory as a Pandas DataFrame as follows:
raw_df = pd.read_csv('./medical-nlp/data/mtsamples.csv', index_col=0)
raw_df.head()
There are 40 unique categories of medical specialties in the dataset. As described in [1], the number of categories of medical specialties can be reduced from 40 to the following 9 categories by filtering the dataset in various ways:
This filtered dataset is further preprocessed by changing the transcriptions column to lower case in this article for better demonstration.
2. Defining Domain-Specific Entities and Types
A NER model in spaCy is a supervised deep learning model. Thus labeled entities are required for each of the documents in the dataset for model training and testing.
Typically a text annotation tool such as prodigy is used to annotate entities with types in documents. This article utilizes an alternative method to automatically generate annotated text dataset. To this end the entities (not their locations in documents) and types need to be specified.
For the purpose of demonstration, it is assumed that we are interested in the following entities (diseases or medicine) and types (medical specialties):
The entities for each entity type are stored as a set in the implementation. As an example, the following is the set of surgery:
surgery = set(['acute cholangitis', 'appendectomy',
'appendicitis', 'cholecystectomy',
'laser capsulotomy', 'surgisis xenograft',
'sclerotomies', 'tonsillectomy'
])
3. Generating Annotated Dataset
As described in the previous section, in order to train and test a spaCy NER model, each of the text documents that contain entities needs to be labeled with text annotations in the following template:
(‘document text’, {‘entities’: [(start, end, type), …, (start, end, type)]})
Once the entities and types are identified (see the previous section), an annotated text dataset can be automatically generated as follows:
- create a spaCy entity ruler model using the identified entities and types
- use the entity ruler model to find the location and type of each entity from a given document
- use the identified location and type and the template above to label the given document
A spaCy entity ruler model can be created in three steps (see the __init__ method in the class RulerModel below):
- create an empty model for a given language (e.g., English)
- add an entity_ruler pipeline component into the model
- create entity rules and add them into the entity_ruler pipeline component
import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRulerclass RulerModel():
def __init__(self, surgery, internalMedicine, medication, obstetricsGynecology):
self.ruler_model = spacy.blank('en')
self.entity_ruler = self.ruler_model.add_pipe('entity_ruler')
total_patterns = [] patterns = self.create_patterns(surgery, 'surgery')
total_patterns.extend(patterns) patterns = self.create_patterns(internalMedicine, 'internalMedicine')
total_patterns.extend(patterns) patterns = self.create_patterns(medication, 'medication')
total_patterns.extend(patterns)
patterns = self.create_patterns(obstetricsGynecology, 'obstetricsGynecology')
total_patterns.extend(patterns)
self.add_patterns_into_ruler(total_patterns)
self.save_ruler_model()
def create_patterns(self, entity_type_set, entity_type):
patterns = []
for item in entity_type_set:
pattern = {'label': entity_type, 'pattern': item}
patterns.append(pattern) return patterns
def add_patterns_into_ruler(self, total_patterns):
self.entity_ruler.add_patterns(total_patterns)
Once an entity ruler model has been created, it can be saved into file for later usage as follows:
ruler_model.to_disk('./model/ruler_model')
The method assign_labels_to_documents() in the class GenerateDataset below uses the above entity ruler model to find the location and type of entities and uses them to generate annotated dataset:
class GenerateDataset(object):
def __init__(self, ruler_model):
self.ruler_model = ruler_model
def find_entitytypes(self, text):
ents = []
doc = self.ruler_model(str(text))
for ent in doc.ents:
ents.append((ent.start_char, ent.end_char, ent.label_)) return ents
def assign_labels_to_documents(self, df):
dataset = []
text_list = df['transcription'].values.tolist() for text in text_list:
ents = self.find_entitytypes(text)
if len(ents) > 0:
dataset.append((text, {'entities': ents}))
else:
continue return dataset
Once an annotated text dataset has been generated, it can be divided into subsets for model training, validation, and testing.
4. Creating and Training NER Model
Similar to entity ruler model, a spaCy NER model can be created in two steps (see the __init__ method of the class NERModel below):
- create an empty model for a given language (e.g., English)
- add a ner pipeline component into the model
Once a NER model has been created, the method fit() of the class NERModel can then be called to train the model using the model training dataset.
The model training follows the typical deep learning model training process:
- set the number of epochs (default to 10 in this article)
- shuffle training data and divide it into mini-batches for each epoch
- setup the model hyper-parameters such as drop rate
- get spaCy optimizer and use it to manage the calculation of objective function loss and the update of weights by back-propagation
- show the model training progress (by using the tqdm library in this article)
import spacy
from spacy.util import minibatch
from spacy.scorer import Scorer
from tqdm import tqdm
import randomclass NERModel():
def __init__(self, iterations=10):
self.n_iter = iterations
self.ner_model = spacy.blank('en')
self.ner = self.ner_model.add_pipe('ner', last=True)
def fit(self, train_data):
for text, annotations in train_data:
for ent_tuple in annotations.get('entities'):
self.ner.add_label(ent_tuple[2])
other_pipes = [pipe for pipe in self.ner_model.pipe_names
if pipe != 'ner']
self.loss_history = []
train_examples = []
for text, annotations in train_data:
train_examples.append(Example.from_dict(
self.ner_model.make_doc(text), annotations))
with self.ner_model.disable_pipes(*other_pipes):
optimizer = self.ner_model.begin_training()
for iteration in range(self.n_iter):
print(f'---- NER model training iteration {iteration + 1} / {self.n_iter} ... ----')
random.shuffle(train_examples)
train_losses = {}
batches = minibatch(train_examples,
size=spacy.util.compounding(4.0, 32.0, 1.001))
batches_list = [(idx, batch) for idx, batch in
enumerate(batches)]
for idx, batch in tqdm(batches_list):
self.ner_model.update(
batch,
drop=0.5,
losses=train_losses,
sgd=optimizer,
)
self.loss_history.append(train_losses)
print(train_losses)
def accuracy_score(self, test_data):
examples = []
scorer = Scorer()
for text, annotations in test_data:
pred_doc = self.ner_model(text)
try:
example = Example.from_dict(pred_doc, annotations)
except:
print(f'Error: failed to process document: \n{text},
\n\n annotations: {annotations}')
continue
examples.append(example)
accuracy = scorer.score(examples)
return accuracy
The following is the screenshot of the spaCy NER model training output:
The loss history has been saved in the fit() method of the NERModel class during model training.
from matplotlib import pyplot as pltloss_history = [loss['ner'] for loss in ner_model.loss_history]
plt.title("Model Training Loss History")
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.plot(loss_history)
The above code plots the following loss history diagram using the saved loss history data.
Once a NER model has been trained, it can be saved into file for later usage as follows:
ner_model.to_disk('./model/ner_model')
5. Evaluating NER Model
Once a spaCy NER model has been trained, then the accuracy_score() method of the NERModel class can be called with the model testing dataset to get the model accuracy performance results.
The original accuracy result is a Python dictionary that contains the precision, recall, and f1 score for tokens, entities, and per entity type. Such a dictionary can be formatted as a Pandas DataFrame as follows:
6. Extending NER Model
As described before, in this article an entity ruler model is created and used for automatically generating annotated text dataset.
The following code shows how to extend the trained NER model by combining the same entity ruler model and the NER model into one. This is useful because an entity recognized by the entity ruler model can be missed by the NER model due to the limitation of training data and/or time.
from spacy.language import Languagedef extend_model(surgery, internalMedicine,
medication, obstetricsGynecology):
ruler_model = spacy.load('./model/ruler_model')
base_ner_model = spacy.load('./model/ner_model') @Language.component("my_entity_ruler")
def ruler_component(doc):
doc = ruler_model(doc)
return doc
for entity_type_set in [surgery, internalMedicine,
medication, obstetricsGynecology]:
for item in entity_type_set:
base_ner_model.vocab.strings.add(item) if 'my_entity_ruler' in base_ner_model.pipe_names:
base_ner_model.remove_pipe('my_entity_ruler') base_ner_model.add_pipe("my_entity_ruler", before='ner')
return base_ner_model
7. Recognizing and Visualizing Named Entities
Once a spaCy NER model has been trained and/or extended, then we can use it to recognize named entities from a given text document as follows:
doc = extended_ner_model(text)
The recognized entities can be accessed via doc.ents.
The function below is to display the recognized named entities using color encoding in Jupyter notebook:
def display_doc(doc):
colors = { "surgery": "pink",
"internalMedicine": "orange",
"medication": "lightblue",
"obstetricsGynecology": "lightgreen",
}
options = {"ents": ["surgery",
"internalMedicine",
"medication",
"obstetricsGynecology",
],
"colors": colors
} displacy.render(doc, style='ent',
options=options, jupyter=True)display_doc(doc)
The following is the display of the annotated entities in two sample documents:
8. Conclusion
In this article, I used the same dataset [2][3] as described in [1] to show how to implement a healthcare domain-specific Named Entity Recognition (NER) method using spaCy [4]. In this method, first a set of medical entities and types was identified, then a spaCy entity ruler model was created and used to automatically generating annotated text dataset for model training and testing, after that a spaCy NER model was created and trained, and finally the same entity ruler model was used to extend the capability of the trained NER model. The results demonstrated that this new NER method is effective in recognizing clinical domain specific named entities.
References
[1] Y. Huang, Common Machine Learning and Deep Learning Methods for Clinical Text Classification
[2] MTSamples: https://www.mtsamples.com
[3] T. Boyle, https://www.kaggle.com/tboyle10/medicaltranscriptions
[4] spaCy: https://allenai.github.io/scispacy/
ACKNOWLEDGEMENT: I would like to thank MTSamples and Kaggle for the dataset.