The global COVID-19 pandemic starting in early 2020 has led to considerable impact on healthcare systems in every country around the world. With massive influx of COVID-19 cases, hospital workers are struggling with providing the limited healthcare resources to each and every patient. One of the most important resources, intensive care unit (ICU), can be vital to the lives of the most severely-impacted patients, and therefore must be distributed efficiently to ensure the patients’ are saved before such resources are runned out.
Having realized the importance of this issue, hospitals in Sírio-Libanês, São Paulo and Brasilia published data on their past COVID-19 patients and asked data scientists to come up with models that can predict a patient’s need for ICU services. In this report, I’m going to analyze this data with classification models in python. I will start with a simple exploratory data analysis, followed by building and comparing five different models. After I select the best-performing model, I will do a Randomized Grid Search Cross Validation to fine tune the model. Lastly, I will evaluate the model on patients who have been admitted to hospitals for different periods of time.
The analyses conducted in this project is based on the following data:
COVID-19 - Clinical Data to assess diagnosis
Sírio-Libanês data for AI and Analytics by Data Intelligence Team
Exploratory Data Analysis
Before diving into real prediction, I did a simple EDA to make sense of the data.
- Introduction to Column Variables
There are 231 columns in the data, with PATIENT_VISIT_IDENTIFIER used to identify each individual patients. The first three columns are about patient demographic information. Next, there are nine columns on the patient’s previous grouped diseases. Then there are 210 columns about blood results and another 6 columns on vital signs.
The last two columns, as shown above, are essential to my analysis. WINDOW is a record of the number of hours the patient has been admitted to hospital. Each patient is recorded for 5 times during the 5 WINDOW periods (0–2 hours, 2–4 hours, 4–6 hours, 6–12 hours, and above 12 hours). ICU is a record of ICU admission of the patient during each window hour, with “1” being admitted to ICU and “0” being not admitted. One of my main goals in this project is to predict a patient’s need for ICU service as early as possible, in other words, with the least number of window hours.
2. Explore Missing Value
As shown above, the data contain a lot of missing values, especially in blood results (57.35% of data are missing). According to the website where I found this data, I’m allowed to fill the data with the patient’s next or previous entry, which I did as follow.
3. Add Column ICU_result
According to website, it’s better to not use the data where the patient has already been sent to ICU, because “it is unknown the order of the event” (maybe they were admitted to ICU before all the other data were obtained), therefore I decided to group each patient’s entries together and add a new variable ICU_result to them and drop the data where the patient has already been sent to ICU.
As the diagram below shows, the proportions of patients who need and don’t need ICU care are approximately the same.
4. Make Sure Variables are Float / Int to Facilitate Model-Building
After checking info of the dataframe, I found only AGE_PERCENTIL and WINDOW are not in int / float. Since WINDOW is a categorical variable, I will just convert it into dummy variables later in model-building. AGE_PERCENTIL is converted into int with the following operation.
I check info again and all variables are in float / int, which means I can conduct my prediction with classification models now.
- Get Dummy Variables and Split the Data for Training and Testing
2. Baseline Classification
I conducted a baseline classification just for reference, the accuracy is 67.46%.
3. Decision Tree
Next, I conducted a decision tree model.
The accuracy of the decision tree is 94.54%, and the recall score is 86.86%. (I included the recall score because I wanted the model to predict the least number of false negative cases, which means it falsely predicts that a patient doesn’t need ICU care when in fact he/she does).
4. Bagging Classifier
Bagging classifier has accuracy 96.20%, and recall score 89.78%.
5. Random Forest Model
Random forest has accuracy 96.67%, and recall score 90.51%.
6. AdaBoost Classification
AdaBoost classification model has accuracy 93.82%, and recall score 85.40%.
7. Voting Classification
Voting classification model has accuracy 96.67%, and recall score 89.78%.
8. Comparing Models
After I made a simple study on each model, I collected data on accuracy and recall for each of them. Apparently, random forest is the best performing model with highest accuracy and recall score, so I decided to carry on my analysis with random forest.
To improve my random forest model, I conducted a Randomized Grid Search Cross Validation.
The cross validation gives back a set of parameters that are optimal for my random forest model as below.
I evaluated the improved model and obtained the following results:
Apparently, both the accuracy and recall score of the random forest model is improved, with accuracy being as high as 97.15%, and recall score being 91.97%. Next, I’m going to use the model on different window period to see if it performs well when the patient is admitted to the hospital for shorter period of time.
Making Predictions with Different Window Periods
To make prediction on different window periods, I first extracted and processed the data according to different window periods.
I made four new DataFrames containing data of the first, first two, first three, and first four window periods.
Next, I made a function using the optimalModel that I just obtained to make predictions on the four new DataFrames.
Below is the results for the four sets of data of different window periods.
I summarized the accuracy score of the four predictions in the diagram below:
From the results above, even though the model doesn’t perform well in the prediction of the first window period (0–2 hour), accuracy surges to 94.09% in the second window period (0–4 hour) and keeps increasing as window period expands. Therefore, this random forest model can be very helpful in locating patients with potential ICU need during the first four hours he/she is admitted to hospital with above 90% accuracy.
- More model simplifications can be conducted to improve the random forest model and prevent overfitting.
- Prediction on the 0–2 hour window period is still not accurate enough. To improve accuracy and recall score value, more data can be used to train the model.
- Number of false negative predictions can be further decreased with better model so that people in need of ICU services will not be ignored and proper care can be extended to them.