Using machine learning to predict intensive care unit patient survival

Published in

Center for Open Source Data and AI Technologies

8 min readJul 1, 2020

Using machine learning to predict intensive care unit patient survival

Image credit: https://www.widsconference.org/datathon.html

Earlier this year we participated in the WiDS Datathon 2020, where we had to solve a business problem with social impact. In this blog post I’ll share our findings.

The problem

**Image credit:** https://medicalxpress.com/news/2020-01-hospital-critical-resuscitation-patients-chances.html

The first 24 hours in an intensive care unit (ICU) are extremely critical for any patient. Based on the condition of the patient, the hospital must provide highly advanced critical care tailored to individual patient’s needs to improve their chances of survival. There are many challenges in the functioning and delivery of the ICU of a hospital system, which include the demand for sufficient and specialized intensivists (a board-certified physician who provides special care for critically ill patients), technologies, and material supplies. These challenges directly impact the patient’s survival.

With the availability of large ICU datasets, we can uncover key patterns from patients’ health records which can be used to predict optimal medical care at the most important time. It can also help the doctors or nurses to build a personal profile of the patient, which can greatly help them to identify the top risk factors associated with patient survival.

Our goals

We had two major goals. First, to identify the top risk factors associated with a high mortality rate, that will help the medical system to understand the severity of the patient’s condition and to take quick action. Second, to predict the mortality rate which can help the medical facility to create a better action plan to improve patient survival. We think these may work as an efficient long-term solution for hospitals

Data

We downloaded the dataset from the Women in Data Science Datathon 2020 competition website on Kaggle. The training dataset contains more than 97,000 hospital ICU records from patients, covers a one-year timeframe, and includes 186 attributes. The binary target variable is ‘hospital_death’, which indicates whether the patient died while hospitalized.

We started with exploratory data analysis, followed by data preparation. We did a few iterations of feature selection and building models. Finally, we compared the model performance and created the predictions. We used Jupyter notebooks using Anaconda and Python libraries, especially, sci-kit learn Machine Learning libraries. The following diagram illustrates the process we followed.

Exploratory Data Analysis

We noticed that the target classes are imbalanced. Only about 8.6% of the observations have hospital_death = 1. That means, if we were to always predict 0, we would achieve an accuracy of 91.4%. That would be easy but useless.

That’s why classification accuracy is not generally used to predict imbalanced classes. The quality of models in the Kaggle competition was measured using Area under the ROC (Receiver Operating Characteristic) Curve (AUC) — a popular measure of model quality that is computed as the trade-off between true positive rate (TPR) and false-positive rate (FPR). It is a measure of the separability between the two target classes. Higher AUC indicates a more reliable model.

There are a few ways to handle imbalanced classes, such as resampling or Synthetic Minority Over-sampling Technique (SMOTE). We tried the resampling approach to upsample the minority class. Up-sampling is the process of randomly resampling observations with replacement from the minority class to reinforce its signal. We used the scikit-learn resample library to perform the up-sampling.

Data preparation

We performed data preparation as follows:

For the categorical features, we dropped constant value features and features with all unique values, such as identifiers. We merged categories that looked similar based on their names. For instance, feature ‘apache_2_bodysystem’, has two categories, ‘Undefined Diagnoses’, ‘Undefined diagnoses’, which seemed the same and we, therefore, merged them. We also encoded the categorical features as a one-hot numeric array using a label encoder. For the continuous features, we removed the features with more than 75% missing values. For the remaining features, we imputed missing values using Simple Imputer with imputation strategy = “mean”, which replaces the missing values using the mean.

Feature Selection

Since the dataset has many features, our primary focus was feature selection. We initially could not even run all the models using all 186 features due to a lack of resources. We used a Random Forest model to perform feature selection and selected the top 30 features for modeling.

Modeling

We partitioned the data set into a 70:30 training/test split.

X_train, X_test, y_train, y_test = 
   train_test_split(X, y, test_size=0.2, random_state=rand_st)

We applied various modeling techniques, starting with Logistic Regression.

from sklearn.linear_model import LogisticRegression
start_ts=time.time()
clf = LogisticRegression(random_state=rand_st)
scores=cross_validate(clf, X_train,y_train, scoring=scorers, cv=5)scores_Acc = scores['test_Accuracy']
print("Logistic Regression Acc: %0.5f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))scores_AUC= scores['test_roc_auc']                                                                      
print("Logistic Regression AUC: %0.5f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))
print("CV Runtime:", time.time()-start_ts)

The Logistic Regression model produced the following results:

Logistic Regression Acc: 0.92601 (+/- 0.00)Logistic Regression AUC: 0.86425 (+/- 0.00)
 
CV Runtime: 38.70279788970947

Next, we tried the Random Forest classifier.

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier, AdaBoostClassifier
start_ts=time.time()
clf = RandomForestClassifier(criterion='entropy', max_depth=None, n_estimators=numTrees, min_samples_split=3, random_state=rand_st)    scores = cross_validate(clf, X_train,y_train, scoring=scorers, cv=5)scores_Acc = scores['test_Accuracy']                                                                                                                                   print("Random Forest Acc: %0.5f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))scores_AUC= scores['test_roc_auc']                                                                                    
print("Random Forest AUC: %0.5f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))
print("CV Runtime:", time.time()-start_ts)

The Random Forest classifier produced the following results:

Random Forest Acc: 0.93008 (+/- 0.00)Random Forest AUC: 0.88803 (+/- 0.00)CV Runtime: 51.02054500579834

The Random Forest model AUC was better than the Logistic Regression AUC. We also tried Ada Boosting, Gradient Boosting, Support Vector Machines, and Neural network models. In all cases, the goal was to determine the optimal model based on AUC and computation time. The computation time is a critical factor in any modeling since there are trade-offs between the accuracy of the model and time.

The Gradient Boosting experiment

start_ts=time.time() 
clf=GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=numTrees, min_samples_split=3, max_depth=3, random_state=rand_st)     
scores=cross_validate(clf, X_train,y_train, scoring=scorers, cv=5)scores_Acc = scores['test_Accuracy']                                                                                                                                    
print("Gradient Boosting Acc: %0.5f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))scores_AUC= scores['test_roc_auc']
print("Gradient Boosting AUC: %0.5f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))
print("CV Runtime:", time.time()-start_ts)

produced these results:

Gradient Boosting Acc: 0.93037 (+/- 0.00)Gradient Boosting AUC: 0.89271 (+/- 0.01) CV Runtime: 29.075139045715332

The following table summarizes the model comparison using 5-fold cross-validation.

Model performance comparison: accuracy, AUC, and runtime

We found the Gradient Boosting model performing the best based on accuracy and AUC. It might be because the Gradient Boosting uses an ensemble of weak learners to form a strong learner, which outperforms other models in this case. We also found the SVM model taking a super long time to finish.

Finally, we created a stacking model. Stacking is an ensemble technique to combine other models, which are called as base models, by training a meta-model. We used Random Forest, Ada Boosting, and Gradient Boosting as base models and another Random Forest as a meta-model. The reason we tried the stacked model was to get a higher score in kaggle.

Feature Importance

The following figure shows the top features of the final model.

The top 10 features are two probabilistic predictions of mortality scores, several lab reports, specifically systolic blood pressure, lactate content, heart rate, and others and age of the patient. Based on the correlation analysis, we found that these lab results have negative correlations with the target. We also found that both probabilistic predictions of mortality scores have positive correlations with the patient’s death. As we imagined, the age of the patient has positive correlations with the patient’s death.

Summary

This framework attempts to address two potential functions: (1) to identify the top factors associated with a high mortality rate (2) to predict the mortality rate. It uses state-of-the-art machine learning techniques to learn from the ICU data. We think this framework will help the medical facility to create a sustainable care system to address the need of an ICU, such as the availability of doctors, nurses, medical resources, which will improve patient survival.

How our results can be implemented

With some enhancements, this approach can serve as a real-time data-driven decision-making helper for hospital management. A hospital can use it to build a personalized care plan for each patient. If we can gather future data, then this framework can be updated to learn using new data.

Future work

There are many applications of ICU data analysis. We were particularly interested in predicting patients’ survival rates and identifying the top factors of the ICU visit of the patient. However, future work can focus on association analysis of various diseases or diseases and symptoms found in ICU data. Another future work can be to find an optimal scheduling model for the doctors and nurses based on the demand of the ICU patients. It would also be nice to derive new features with the help of a domain expert in medical systems and run the same study again.

Takeaways

The first 24 hours in the ICU unit deals with extremes of life and death. We think our predictive framework will be greatly beneficial, which will help the medical system to provide the highest quality care to the patient.

WiDS competitions allow women to work on great data science challenges with social impact. I have learned a lot while working on this problem and am looking forward to the next WiDS datathon. I want to thank my amazing mentor, colleague, and datathon partner Svetlana Levitan for motivating me to write my first blog post.

Thank you for reading. Please let me know if you have any questions.

Summary

Written by Shatabdi Choudhury