COVID-19 Predicting Death Rate using Classification

Jaswanth Badvelu
Analytics Vidhya
Published in
7 min readMay 28, 2020
Source

The total number of deaths due to COVID 19 crossed 100K mark on 26th May in the USA. With the daily increase in COVID-19 and limited hospital capacity, it is extremely difficult for healthcare workers to monitor all the people who tested positive for COVID 19 since the fatality rate is less than 10% in most of the countries. An article published by CNBC states that a significant number of states lack enough ICU beds in their hospitals to deal with the projected wave of COVID 19 cases. However, Since most of the countries are updating data related to COVID 19 daily. With the advancement in machine learning using historical data, it is easy to predict which people are at risk due to COVID 19.

If we can accurately predict for all the people infected with COVID 19 whether the patient is going to die or not. Then the patient can be given more priority and can be monitored continuously to prevent the patient from dying.

Data Visualization

Because of the complexity in downloading data from the Statistics Canada website, only data related to 2000 COVID 19 cases are collected. Some interesting questions about how COVID 19 is affecting different age groups and genders have been answered in my previous blog(link). Apart from the variables used in my previous blog now, Statistics Canada released updated data with Region, Asymptomatic Status. Let us explore these data before jumping into the classification to see if any new insights can be found.

In which region more people are being Hospitalized?

Fig 1. People Hospitalization Status Region Wise

Atlantic region includes New Brunswick, Nova Scotia, Prince Edward Island, Newfoundland, and Labrador. Ontario region includes Ontario and Nunavut. And Prairies region includes Alberta, Saskatchewan, Manitoba and the Northwest Territories. British Columbia and Yukon are considered as BC. It can be seen that the majority of the cases are from the Ontario region and almost 35% of the people that got infected with COVID 19 are hospitalized when compared with 25% in British Columbia. Percent of people hospitalized in both the Atlantic and the Prairies regions is less than 5%.

Fig 2. People Died Region Wise

The Ontario Region had the highest fatality rate with 17.5% of the infected people dying from Covid-19, Atlantic Region having the lowest with only 7.45% of the infected people dying. The Fatality Rate in the Prairies region is 9.9%. Surprisingly even with more people being hospitalized in British Columbia fatality rate is 11% which is very good when compared with Ontario.

Are the infected people showing any symptoms?

Fig 3. Different Age Group People Symptom Status

Out of all the people who tested positive for COVID 19. It can be seen that almost 50% of the people in both genders are not showing any symptoms even after being infected with COVID 19.

It will be interesting to see the number of people that are being hospitalized after some days without showing any symptoms.

Fig 4. People Hospitalization Status without showing any symptoms

In all the positive tested COVID 19 patients without any symptoms. Only people aged 80+ years required medical attention with some people even requiring ICU. The majority of people aged below 60 years recovered without needing medical attention. Now let us see how many people died without showing any symptoms.

Fig 5. People Death Status without Showing Symptoms

All the people aged below 50 years who didn't show any symptoms recovered from COVID 19. Even for the people aged 50+ years, the majority of the people recovered after getting medical attention.

Classification

In any machine learning model to predict one variable, it is very crucial to identify the right variables before training the machine model since it is very difficult to get all variables every time. The right variables can be identified by using a correlation matrix. In the matrix, all the variables that are highly correlated to the required variable should be considered.

Fig 6. Correlation Matix Plot

It can be seen from the correlation matrix, the Death Variable is correlated with Age group, Hospital Status, and Asymptomatic variables which means these variables have relation with death variable. So these variables are used to train the model to predict whether the patient will die or not.

After training the model using data it is also important to check whether the model is performing well with unseen data. So using sample. split command in R, 70% of the data is randomly divided to train the model and the remaining 30% data is hidden from the model so it can be used to validate the model.

Support Vector Machines

Confusion Matrix & Accuracy when all Variables are used

When the Support Vector Machines model is trained to predict death status. The model predicted with an accuracy of 87.42% when all the variables are used.

Confusion Matrix & Accuracy when only 3 Variables are used to train

Whereas the Accuracy rate is 84.9% when Age group, Hospital Status, and Asymptomatic variables are used to train the model.

Logistic Regression

Confusion Matrix & Accuracy for Logistic Regression in both cases

The Logistic Regression model was able to predict whether the patient is going to die or not with 87.8%. The accurate rate remained the same in both cases when all variables are used and only 3 variables are used.

Decision Tree

In the case of the Decision tree model, the accuracy was 87.42% when the model is trained with all the variables to predict death status.

Confusion Matrix & Accuracy when all the Variables are used to train

Interestingly Accuracy improved slightly with an accuracy rate of 87.75% when only the Age group, Hospital Status, and Asymptomatic variables are used to train the model.

Confusion Matrix & Accuracy when only 3 Variables are used to train

Decision Tree was plotted below to understand better which variables are given more importance.

Fig 7. Decision Tree Diagram with 3 variables

According to this decision tree, It can be interpreted that for all the people aged below 40 years the model assumed he will survive. For the people aged 40+ years, Symptoms are checked followed by Age group and Hospital Status. For the people with no symptoms more priority was given to people aged 70+ years. And in the case of people with symptoms more priority was given to the people who required medical attention with a probability of 75% for the people who got shifted to ICU and 30% for the people who are hospitalized.

Random Forest

The Random Forest model was able to predict whether the patient is going to die or not with the highest accuracy of 87.92% when only Age group, Hospital Status, and Asymptomatic variables are used to train the model.

Accuracy when 3 Variables are used to train

Now let us see if there will be any improvement when all variables are used.

Accuracy when all variables are used to train

The Accuracy rate was 88.09% for Random Forest when all the variables are used to train the model to predict death. The Variable Importance of random forest was plotted to check which factors have the highest impact to predict death.

Fig 8. Variable Importance

It can be seen that as expected Age group, Hospital Status, Asymptomatic variables had more importance.

Conclusion

When the Age Group of the patient, Hospital Status, and whether the patient is showing symptoms or not is known. Using Random Forest we can predict whether the patient is going to die or not with almost 88% accuracy. This helps health care workers to identify the patients with the highest risk of death so more priority can be given to those patients. Even though the accuracy is 88% that means the error rate is 12% here which is a bit high for the Health Care sector since we are dealing with patient life. So the model should be trained with more data and also should be improved using hyper tuning to reduce the error rate before using in the health care sector for decision making.

Data Collection

Data consists of 2000 COVID 19 cases collected from Stats Canada.

--

--

Jaswanth Badvelu
Analytics Vidhya

I write articles about easy ways to implement Data Science and Machine learning techniques in real world.