Named Entity Recognition for Healthcare with SparkNLP NerDL
Data Preparation and Model Evaluation
Out-of-the-box or pre-trained named entity recognition (NER) models can be found in various natural language processing (NLP) libraries, and are usually used to identify and extract proper names of people, organizations and brands from a document. In healthcare however, named entity recognition models are essential for identifying and extracting entities like diseases, tests, treatments, and test results. These “entities” can be analyzed further in order to aid in important work like identifying clinical trial participants, or predicting disease progression.
If you want to learn more about Named Entity Recognition methodologies and research you should check out this blog post. Here I will be focusing on the Conditional Random Field and Deep Learning methods which are both available in the SparkNLP library. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN . This is a novel neural network architecture that automatically detects word- and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering.
SparkNLP NerDL has cutting edge scores with many benchmark healthcare datasets, including a micro-average F1 score of 0.87 with the BC2GM dataset. You need to use licensed SparkNLP Clinical embeddings to get those cutting edge scores on healthcare data, but Glove embeddings still do great. I’ll show you how to train and evaluate your NerCRF and NerDL models on the BC5CDR-Chem benchmark dataset using Glove embeddings.
Why is Data Preparation Important?
It’s simple to use a pre-trained named entity recognition model, but usually in healthcare or clinical NLP you need to train your own model to get the best results. If you want to train your own model, preparing training data is one of the most important steps you will need to take. This tutorial will show you how to prepare your healthcare training data and train your own NER model using Python and SparkNLP.
Preparing the Training Data
To train a NerDL or NerCRF model, you will need to put your tokens and entity labels into a space-separated format called CoNLL. A CoNLL file puts each token of a sentence on a different line, and separates each sentence with an empty line. In the following Python example I will annotate one sentence and save it in CoNLL format.
Notice the entity labels above. When an entity has more than one word, the label for the first word should begin with “B-” and the label for the following words should begin with “I-”. Now let’s save the tokens, parts-of-speech, and entity labels in CoNLL format.
Check out the printed CoNLL above. “An” is the first word in “An apple a day” so it is labelled “B-Treatment”, while “apple”,”a”, and “day” are all labelled “I-Treatment”. The words that are not “Treatments” are labelled with a capital “O”.
Here’s another example of a sentence annotated in CoNLL format. The entity is “blood pressure”.
As you can see above, ‘blood’ is the first word in the entity, so it is labelled “B-Test”, while ‘pressure’ is the second word in the entity so it is labelled “I-Test”. We do this so the model can tell that “blood pressure” forms one entity instead of the two separate entities “blood” and “pressure”.
How to Convert a Pandas Dataframe to CoNLL Format
In the next example I’ll read from a Pandas dataframe and write a CoNLL file for NerDL. I’ll use the sentence ID (sent_id) column to determine if I need to leave an empty line before a new sentence. Here are the first 5 rows of the dataframe:
For NerDL the part-of-speech column is not used, but a CoNLL must still have a part of speech column. Add a part-of-speech column with ‘NN’ or some other placeholder as the only value. If you already have a part of speech column, you don’t need to take this step.
My Pandas dataframe is called ‘ncbi’ and I’ve added a part-of-speech column which I’ve called ‘pos’. Now write a CoNLL file using the columns of the Pandas dataframe as input.
If you look at the first 25 lines of the final CoNLL file below, you’ll see that rows containing only line breaks signal the beginning of a new sentence.
Now let’s see SparkNLPs cutting edge results! We’ll train NerCRF and NerDL models on the BC5CDR-Chem benchmark dataset.
Training and Evaluating NerCRF
NerCRF is a named entity recognition model in the SparkNLP library which is based on Conditional Random Fields. It requires part-of-speech for model training. To train a model with NerCRF, first import SparkNLP and start your Spark session. Then load the CoNLL.
I will add Glove embeddings to the dataset before NER training, but if you want better results with your healthcare projects, use SparkNLP Clinical embeddings. First, set up your pipeline and fit your model to your training dataset. The fitting process could take some time.
Next add word embeddings to your test dataset, and make your predictions with the ner_model.
You can see all of your input and output columns in the final “predictions” dataframe. I’ll focus on the ‘ner’ column which contains the prediction, and the ‘label’ column which contains the ground truth. You can use sklearn.metrics classification_report to check the accuracy of the predictions using these 2 columns.
Training and Evaluating NerDL
NerDL is a deep learning named entity recognition model in the SparkNLP library which does not require training data to contain parts-of-speech. For a more detailed overview of training a model using NerDL, you can check out this post.
We’ve already loaded the BC5CDR-Chem test and train datasets. Now I can show you how to add Glove embeddings and save the test data as a parquet file before NerDL model training.
Next set up the rest of the pipeline by adding the location of the test data parquet file and the folder where your Tensorflow graphs are located. Using “.setEvaluationLogExtended(True)” will output a more detailed model evaluation log. When you run the training, If you get an error for incompatible TF graph, use NerDL_Graph.ipynb located here to create a graph using the parameters given in the error message. If you’re having trouble with this part of NerDL model training, you should read this post.
Even though the word_embeddings pipe is in a previous cell, it is still part of the pipeline. In the next cell I’ll fit the model to the training set. This could take some time.
You can find the final log in ~/annotator_logs:
For each training epoch your extended log will print 2 sets of metrics, one for the validation dataset and one for the test dataset. (The metrics for the validation data is on the top). For each dataset there’s a table showing true positives (tp), false positives (fp), false negatives (fn), precision, recall and f1 scores for each entity (except ‘O’). Beneath this table you’ll find the macro-average and micro-average precision, recall and f1 scores for the dataset. So if you’re looking for the micro-average f1 score for the test data, you’ll find it on the last line of the log for each epoch.
Quick recap — If you’ve read this article you should know how to prepare CoNLL files for training SparkNLP NerDL and NerCRF models using Glove embeddings. You should also know how to train and evaluate these models using evaluation logs and Sklearn Metrics.
Overall our NerDL and NerCRF models didn’t do too bad with the BC5CDR-Chem benchmark dataset enriched with Glove embeddings. In the 11th epoch the NerDL model’s macro-average f1 score on the test set was 0.86 and after 9 epochs the NerCRF had a macro-average f1 score of 0.88 on the test set. However, using Clinical embeddings instead of Glove will bring your NerDL micro-average F1 score from 0.887 up to 0.915, much closer to the best published score for this dataset.
You can find all the code for this tutorial here.