Predictions using the MIMIC-III Database (Part 1)

Pallab Paul
Intel Student Ambassadors
6 min readFeb 4, 2019

MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising of de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. This database includes information such as demographics, vital sign measurements made at the bedside, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). More information about the MIMIC-III database as well as information on how to access this database can be found on Physionet’s website (https://mimic.physionet.org/).

From the 25 tables including ADMISSIONS, CALLOUT, CAREGIVERS, PRESCRIPTIONS, SERVICES and TRANSFERS provided in the database, users can gather a lot of information about each patient and use this information for a myriad of machine learning/ deep learning tests and predictions. I will be focusing on how to use this information to improve patient mortality rate predictions and patient re-admission predictions.

For the first few weeks of this project I focused on gathering the dataset, understanding what it meant and visualizing it with tools including pgAdmin4, Tableau and Jupyter Notebook. I also learned about and calculated various severity scores for each patient which I will talk about later on in the article.

When allowed access to the MIMIC-III database, it is suggested that you transfer all of this information into a RDMS (relational database management system) and Physionet has tutorials on how to transfer the database into a local instance of the PostgreSQL RDMS which I followed. After connecting to the PostgreSQL database, I was able to easily make SQL queries and connect my database to many helpful tools such as pgAdmin4 which provides a GUI (graphical user interface) for the database.

How the MIMIC-III database looks on pgAdmin4

Through this GUI I was able to visualize the tables properly and understand the data and their relationships better.

I then started learning about ICU severity scores and how they are related to a patient’s mortality rate. ICU scoring systems were introduced almost 30 years ago with the goal of using physiologic data available at ICU admission to predict individual patient outcomes. Although these predictions have little utility for managing individual patients, they provide a mechanism to assess ICU performance by comparing actual outcomes in a given population to the outcomes observed in the reference population used to develop the prediction algorithms. Some of the most popular severity tests include OASIS, SAPS, SOFA and SAPSII. The GitHub repository: https://github.com/MIT-LCP/mimic-code/tree/master/concepts/severityscores consists of SQL scripts that can be used along with the MIMIC dataset to obtain these severity scores based on the information provided in the database. Below is a description of each test as well as a graph of the test results and how many patients scored each score of the test out of ~60,000 patients. These graphs were created through Tableau after connecting my Tableau server with my PostgreSQL database.

SOFA

The Sepsis-related Organ Failure Assessment score was first developed by a consensus meeting of the ESICM in October 1994, though it eventually became known as the Sequential Organ Failure Assessment (SOFA) score as it was applied outside of septic populations (Vincent, 1996). The purpose of the score was to provide the clinical community with an objective measure of the severity of organ dysfunction in a patient. It is stressed that the score is not meant as a direct predictor of mortality but rather a measure of morbidity, or the level of the diseased state, in a patient. The score is evaluated for 6 organ systems: pulmonary, renal, hepatic, cardiovascular, haematologic and neurologic. Each system’s result is given a score from 0–4 which causes the scores range to be from 0–24 with 0 being the least severe condition and 24 being the most severe condition and an average having >90% chance of mortality.

Y-axis is the number of patients, X-axis is the SOFA score of the patient

SAPS

The Simplified Acute Physiology Score (SAPS) was intended as a simplification of the Acute Physiology Score (APS), reducing the number of physiological parameters required from the original 34 to 13 plus age (LeGall, 1984). The variables chosen were present for 90% of patients in the initial survey used to develop the APS (Knaus, 1981). The higher the SAPS score, the higher the severity of the patient.

Y-axis is the number of patients, X-axis is the SAPS score of the patient

SAPS II

The Simplified Acute Physiology Score II (SAPS II) published in 1993 (Le Gall, 1993), aimed to rectify two issues with SAPS. First, the variable selection process in SAPS was done by clinical judgement, whereas SAPS II utilised univariate feature selection to filter out features uncorrelated with hospital mortality. Second, there was no model for calculating a probability of mortality from SAPS. SAPS II ranges from a score of 0–163 with a score of 0 meaning 0% of mortality and 163 meaning 100% of mortality.

Y-axis is the number of patients, X-axis is the SAPS II score of the patient

OASIS

The Oxford Acute Severity of Illness Score (OASIS) is a parsimonious severity score developed using a hybrid genetic algorithm and particle swarm optimization approach which allowed direct optimization of a severity score in a clinically relevant form with simultaneous multivariate feature selection (Johnson, 2013). OASIS was designed to have an extremely low burden for data collection and quality control, requiring only 10 features, and not requiring laboratory measurements, diagnosis or comorbidity information.

Y-axis is the number of patients, X-axis is the OASIS score of the patient

After examining these results I connected my PostgreSQL to a local Jupyter Notebook where I could execute visible SQL queries. I used these queries to determine the cohort I want to use for my future tests. This cohort includes adults (patients whose age was > 15 years of age at the time of the ICU admission) and the patients first admission only to prevent confusion with readmissions. To aid me with my cohort selection, I used some sample queries provided on Physionet: https://mimic.physionet.org/tutorials/intro-to-mimic-iii/.

We first start by figuring out the patient’s birth date and the patient’s admission dates to the ICU:

Next, we figure out the patient’s first admission date so that we do not have multiple records of the same patient and so that we only have one age per patient:

We then find the age of the patient by finding the difference between their date of birth and the date of their first admission. We put these age groups into three categories: neonatal (< 15 years of age), adult (age range of 15–89) and >89 years of age:

Finally, we can use this information to categorize the patient’s that we want in our cohort for further tests:

Final cohort groupings, I will be using the male and female adults:

The Tableau workbook and the Jupyter Notebook as well as any additional materials I use for this project can be found on my GitHub repository for this project here: https://github.com/PallabPaul/mimic-mortality-predictions.

Now that I was able to examine the database properly by determining the severity scores of all the patients and by selecting a cohort for further testing, my next step will be to draw a correlation between these severity tests and mortality rate prediction. Also I will be emulating the deep learning tests made with the Super Learner algorithms discussed in a research study published in the “Journal of Biomedical Informatics” found here: https://arxiv.org/pdf/1710.08531.pdf.

--

--