MedCAT | Introduction — Analyzing Electronic Health Records

An introduction on how to use MedCAT to organise, structure and analyse electronic health records (EHRs). As an example, MedCAT was used in a recent study on ACE inhibitors and CoVid-2019.

7 min readApr 4, 2020

EHRs are a treasure trove of medical information; there is an unbelievable amount of knowledge contained in them. To mention a few data points usually available for each patient: (1) Diseases with symptoms; (2) Medications, often coupled with dosing instructions, eventual side-effects and patient’s feedback; (3) Treatments, sometimes with patient’s feedback; (4) Risks, assumptions and opinions from clinicians; (5) Patient’s testimonies; (6) Various measurements and lab results.

When working with EHRs, we first need to structure and organise them. The information they contain is usually available in an unstructured format (free text). For doctors, this is relatively ok; they can read the documents and get the information they need. But, if we want to use the data for statistical analysis or machine learning, it is challenging without structure.

Given a structured EHR, some possible research use-cases are (1) Mortality prediction; (2) Disease risk calculation; (3) Clinical coding; (4) Temporal modelling of diseases or patients; (5) Disease/Medication Interaction; (6) Detection of adverse drug reactions;

I, of course, am not the first one to understand that the data available in the EHRs can greatly benefit both patients and medical institutions. Recently we have seen large movements from Google Health (the data aggregation tool), Amazon (their Comprehend Medical ) and many others. While it is nice to see that big companies understand the value of EHRs, it will take a long time before they can get full access to hospital data (privacy problems). And finally, even when they get access, it is questionable whether it will enable research projects or help patients.

My goal here is to show that with a tool like MedCAT, we can structure EHRs of large hospitals in a matter of hours on a personal laptop, without the need for large servers and infrastructures. With this, we can potentially enable/do research projects that could improve healthcare.

The rest of this post is organised as follows. First, we are going to define a small project that will be used to showcase MedCAT. Second, we are going to look at the dataset we are using. And third, we will check out the environment setup.

Please note: (1) Everything is done in Python, so basic knowledge of it is necessary; (2) I will not share any medical datasets but will show where and how to access/get them; (3) Some knowledge of statistics and machine learning will help, but not necessary; (4) We are focusing on EHRs here, the same approach can be used on other types of documents, but currently, the tool is mainly tested on biomedical documents.

Overview of the MedCAT Tutorial

Each of the posts in this tutorial is meant to be a stand-alone post, while still building on top of the same story. Feel free to jump to the post that is of interest to you, it should still be easy to follow and understand (The links will be updated as the tutorials get published).

Introduction — This post
Dataset Analysis and Preparation — This part is more generally concerned with setting up an ML project, analysing datasets, pre-processing text and everything else necessary before we start with modelling.
Extracting Diseases from Electronic Health Records — A deep dive into the MedCAT library in Python and Named Entity Recognition and Linking of medical concepts. Useful if you are only interested in how to use MedCAT.
Supervised Training and the Full MedCAT Pipeline — Exploration of the more advanced parts of the MedCAT library and how to build a complete pipeline for NER+L & MetaAnnotations.
Analysing the Results — An example of what is possible once we have extracted the entities of interest from Electronic Health Records.
Other Tools and Functions of the MedCAT library (ICD10 codes, SNOMED CT) & What to expect in the Future

Introduction — Project definition

Let’s look at an example. Assume we got access to the database of a large hospital, each patient in that hospital has an EHR which contains a lot of free text (example of a document bellow, you can find more at mtsamples). Apart from that, each EHR also includes a couple of structured fields like age, gender and race. Now, assume that our project is to show the relation between diseases and age (can be used to calculate age-related disease risk scores). To do this, we need to know the age for each patient, plus the diseases that appear in his EHR. Extracting the age is easy, that is a structured field, but the problem is diseases. They are only mentioned in the free text and usually nowhere else. Before we can continue, we need to extract the diseases from each EHR and save them in a structured database.

Please note: (1) There is much more to the disease extraction problem, but we’ll expand as we go. (2) Diseases are just an example, we can do the same with medications, symptoms, procedures or anything else.

A fake example of an EHR, note that even though this is free text, it is significantly more structured than real EHRs — which are a disaster to read/understand.

A formal definition of the disease extraction problem

What we want to achieve is known in Natural Language Processing (NLP) as Named Entity Detection and Linking (NER+L). NER means detecting the entity in the text (e.g. medical terms, the second step in Figure 1). While L means linking the recognised entity to a concept in a biomedical database (e.g. UMLS, the third step in Figure 1).

The linking part is essential because it allows us to standardise and organize the detected entities, as multiple recognised entities can link to the same medical concept in a biomedical database. For example, in an EHR we can have:

The patient was diagnosed with malignant neoplasm of breast…
Previous medical history includes breast cancer…
Reason for admission: breast CA…

Each one of the concepts in bold is the same disease, just written differently. If we do not standardise the detected entities it would be difficult to calculate statistics on e.g. how many patients have breast cancer.

Furthermore, if we link an entity to a biomedical database we have access to all the structured fields in that database (Figure 2).

A small portion of a biomedical database (UMLS).

Once we have detected and linked entities to our biomedical database, we can, for example, filter entities based on the Semantic Type field or find all entities that link to the biomedical concept with ID C0006142 . An overview of all semantic types available in UMLS can be found here (note that one of the semantic types is Diseases which is exactly what we need).

Please Note: (1) For the biomedical database we’ll be using UMLS, as it is the largest one available with more than 4.2M medical concepts. There are many other biomedical databases, but UMLS fits perfectly our needs here as we want to extract all possible diseases.

The dataset — MIMIC-III

MIMIC-III is an openly available dataset developed by the MIT Lab for Computational Physiology. It includes clinical notes, demographics, vital signs, laboratory tests and more.

I’ve chosen this dataset as it is one of the only openly available datasets that contain EHRs. The dataset can not be downloaded directly, but one needs to submit a request first — which is usually approved within a couple of days.

Environment setup (if running locally)

We’ll be using Python as the primary programming language, some plots, later on, will be done using R or with the help of JavaScript, but this is mainly to make them fancier.

I will use python 3.7 for everything, most likely everything will be fine with a 3.5+ version.

If you are following this tutorial on your local machine, it is recommended to start a new Python virtual environment using:

python3 -m venv medcat

Once that is done, you can clone the MedCAT repository and go into the tutorial directory. From there run (don’t forget to activate the medcatenvironment):

pip install -r requirements

As MedCAT is built on top of SpaCy/SciSpaCy, you will also need to download the language models using:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz

Google Colab

All the code is also available on Google Colab, you can find links to the notebooks in the tutorial section of the repository. This is the easiest way to follow this tutorial, as everything is already configured and premade. But, please note that the Colabs will not use real data (MIMIC-III), but only the publicly available datasets and generated dummy data.