Named Entity Recognition for Clinical Text

Use pandas to reformat the 2011 i2b2 dataset in order to train a deep learning natural language processing model

Published in

Atlas Research

7 min readSep 15, 2020

The datasets from the i2b2 challenge series can be used to support supervised learning in the clinical domain. This resource, currently maintained by Harvard University’s Department of Biomedical Informatics, helps to fill the gap for data scientists who want to tackle challenges related to text data in the healthcare field but don’t have access to de-identified hospital records.

i2b2 stands for Informatics for Integrating Biology & the Bedside, the organization that sponsored a series of text-focused machine learning challenges from 2006–2018. Just to confuse everyone, Harvard now refers to this collection of datasets as n2c2, which stands for National NLP Clinical Challenges. Access requires registration and submission of a data use agreement.

Background on the 2011 i2b2 Challenge

In 2011, i2b2 sponsored a joint challenge with the U.S. Department of Veterans Affairs (VA) on a natural language processing (NLP) task. NLP refers to the field of data science that brings us insights through sentiment analysis, topic modeling, and named entity recognition (NER).

The dataset provided through this initiative would be useful to modern researchers interested in understanding where pronouns, tests, and problems appear in clinical notes.

NER refers to the task of picking out relevant entities from text. Photo by Sindre Strøm on Pexels

Participants in the 2011 i2b2/VA challenge were provided data from Partners HealthCare, Beth Israel Deaconess Medical Center (MIMIC Database), University of Pittsburgh, and the Mayo Clinic to train a coreference resolution algorithm. The data from Partners HealthCare and MIMIC remains available to today’s researchers as they work through their own solutions to entity extraction or coreference.

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for many higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

The 2011 i2b2 dataset is composed of clinical notes that have been de-identified (i.e., all protected health information (PHI) has been removed). The clinical notes are provided as .txt files, along with named entities in .txt.con files (“con” stands for “concepts”). The .txt.con files are structured in the following format : c= “entity” offset || t= “type”.

This is a major pain if your chosen model is expecting a different format for the training data. For instance, CoNLL is another common structure for text entities. This schema gets its name from the Conference on Computational Natural Language Learning. In this data presentation, each token (in this case an individual word or punctuation mark) sits on its own line. The entire tokenized document is presented in that first column. The entity tag is in the last column.

A custom parser is required to transform the data from i2b2's entity-only, offset-based annotation format into CoNLL’s all-token, table-based format. Luckily, we’ve written one for you.

Intro to Parsing 2011 (or 2010) i2b2 to CoNLL format

The steps outlined below are derived with gratitude from the 2009 i2b2 parser created by Maximilian Hofer. Each year, i2b2 changed the data format slightly to suit the challenge at hand, so the parsing approach changes slightly as well. I’ve modified the original code so the user won’t need to change any file names in the 2011 dataset. I’ve also attempted to add abundant comments so that the novice programmer can use this exercise to learn about pandas.

Although our parser was specifically developed for the 2011 dataset, the 2010 data shares the same format (though slightly different file and folder naming conventions), so the parser should be extensible to wrangling that year’s data as well.

Deep Learning for Named Entity Recognition #3: Reusing a Bidirectional LSTM + CNN on Clinical Text…

This post describes how a BLSTM + CNN network originally developed for CoNLL news data to extract people, locations and…

towardsdatascience.com

The Parser

As you’d expect, the first step is to bring in the relevant import statements. In the cell below we import some NLP standby packages. As we’ll see, glob is a library for simply and elegantly importing files based on pathname.

I recommend updating pandas’ display options to remove any limitations on character length displayed within a single cell and to show up to 300 rows. This modification to pandas’ default is useful to promote ease of reading. It’s worth noting that to avoid crashing your Jupyter Notebook, you must put some kind of upper limit (e.g., 300) on the number of rows pandas can show at one time.

Next, specify where you’ve stored the i2b2 files.

Now we’re going to loop through all the files — both the annotations “a” and the entire documents “e.” Reminder that annotations are stored in i2b2 format: c= “entity” offset || t= “type”. The documents are clinical notes from Partners HealthCare and Beth Israel Deaconess Medical Center.

Ideally, all the a_ids will match e_ids, and you’ll end up with the same count of files that you started with.

At this point, we can run simple summary statistics on the corpus of annotations to see how many entities fall into each type (i.e., “pronoun,” “problem,” or “test.”) The last entry in each i2b2 row contains the tag for the entity, so we first grab that string with indexing. The second part of this snippet turns the count of entities associated with each type tag into a dictionary.

Back to building our parser, let’s import the text of the files.

Next, set up empty dataframes that will ultimately house the text and annotations.

In this longish cell block below, we fill in the empty dataframes we just created with info from the corpus of annotations.

Although it may not always be necessary to use, I’ve left in the code for BIO tags. “B-” represents “beginning” and is appended to the start of each entity. If an entity spans more than one token, subsequent tokens are prepended with “I-” for “inside.” Everything else is considered “O” for “outside.” This tagging schema is also referred to as I-O-B. And you would think the NLP community would be better at naming things…

For each document in the corpus of annotations (i.e., the .txt.con files from the “concepts” folder), this code iterates through each row, splitting and assigning to the correct column of annotations_df.

Now, let’s do the same with the entries from the “docs” folder. You’ll notice we’re adding “-DOCSTART-” and “-EMPTYLINE-” tags to preserve segmentation of the text.

At the end of this block, we’ll check how many total entities are in the clinical notes by counting the number that have the “B-” tag.

In the cell below, we’ll do some quick data cleaning to ensure all columns are composed of the correct data type. Then we can merge the two corpora (clinical notes text and annotations) and check for NaNs in the result.

The only NaNs should be in the NER_tag column. If that’s the case, let’s go ahead and replace them with “O” for “outside,” referring to tokens that are not named entities. Then we can check how many entities we have compared to the total number of tokens.

Now let’s get ready to add a part of speech tag to the text. Once again, we’ll sanity check the totals.

The following code block utilizes NLTK’s chunking and parsing functions to fill out the syntactic chunk component of the CoNLL parser. We’ll first chunk noun phrases, followed by verb phrases.

At this point, I thought I was finished. Not quite! My model actually expected input as sentences separated by blank lines. If you also need to add blanks, follow these remaining steps:

First, assemble a list containing the index of the last token in each sentence.

Next, set up a blank df composed of blank rows. The index of each row corresponds to each index in the list we created in the last step, just with an offset of 0.5. After concatenating the two dataframes and sorting the index, each sentence will now be followed by a blank row. Finally, reset the index and fill NaNs with “”.

The final step in the process is to export your data! If you’d like to train-test split at this point, I’ve included the code below.

Phew — and now, really, we’re done. i2b2 format has been parsed into CoNLL format and exported as a series of txt files.

First, we set up corpora for annotations and entries. The annotations corpus contains row and offset information for the named entities. The entries corpus is comprised of the full text. We updated / added rows marking empty lines (-EMPTYLINE-) and the start of each new document (-DOCSTART-).

Second, we joined the annotations and entries dataframes based on document id along with row and offset columns. The resulting dataframe is then tagged with part-of-speech and syntactic tags.

Finally, we updated the parser to add blank rows to mark the end of sentences before splitting into 70% training, 15% validation, and 15% testing data. The output of the script is a set of .txt files: train.txt, valid.txt, and test.txt.

Hopefully, you learned something from all that text wrangling. The i2b2 datasets really are a wonderful resource for the NLP community working in the healthcare field. Now you have one more tool for training your machine learning models to promote a down-the-line positive impact on patient care.