The 2020 Pandemic — Classified

Vanthian Balasubramanian
8 min readMay 27, 2020

--

Are we getting Immune? Is the death rate reducing?

Time evolves we “humans get immune”, we get used to the situation, and that is what seems to happen during the current situation. Yes, our bodies have started developing their own antibodies to resist the novel virus. The above statement may not remain true for many but certainly have remained true for the majority. According to an article in Global News , it states that the patients get more immune to the novel virus, and it has got its own pros and cons. “Thus resulting in decrease in deaths due to the pandemic”.

Deaths due to COVID-19 decreases day after day

As a data enthusiast, how do I validate or convince myself that the above facts are actual? I choose to work with a dataset obtained from Statistics Canada and run few data classification models to study the model behaviour and predictions and also to understand the death toll trend in Canada, and finally drew a conclusion on the above facts..

Before running a model to access its ability to predict. We need to determine the attributes which will aid in formulating. To fix that, a plot for each pair of attributes is plotted.

Pairwise

The above plot gives an understanding of the relationships between each of the attributes in the dataset.

Correlation

The plot depicts a clear understanding of the relationship between each attribute and which could potentially aid in computing and analyzing the data using classification models.

Before getting our hands-on analyzing the dataset using the classification models, a clear understanding of the dataset is vital. The selected dataset has information on the age group, gender, status of the individual. The value for each attribute is mentioned categorically.

As far as this blog is concerned, three different data classification models have been used to test the model and evaluate the quantitative measure of the dataset. An attempt to assess the accuracy of the death status of a patient, precisely answering the question “Will the patient be alive or dead?”. It is done by comparing the attribute death against the age groups, gender and mode of transmission. Each of the attributes is classified categorically. The reason behind analyzing this particular dataset is that the particulars are categorized, and the desired output is expected to be computed as a binary variable; with that perspective, this specific dataset was chosen.

Selecting the right attributes

As mentioned, three different classification models were accessed to evaluate and compare the accuracy of their predictions.

To experiment with the dataset, Initially, a model of Death Vs every other attribute of the dataset was used to estimate the accuracy of the model using Decision Tree, and it was observed that “Recovered” tends to dominate the “Death” attribute, thus resulting in overfitting, this implicates that the model couldn’t use other attributes to predict. The below plot depicts the overfitting aspect.

Overfitting

To overcome the ambiguity caused by “Recovered,” a new data frame was loaded by removing the attribute “Recovered.”

The following attributes are taken into consideration for evaluating the model accuracy,

Age Group, Transmission, Hospital Status of the patients

*the decision variable has been addressed as “outcome.”

Data Visualization

Using the above Decision Tree model, the attribute that caused overfitting was removed, and the following plots show the relationship between the attributed selected for analyzing the classification models

Plot 1

The plot shows the number of people affected due to various modes of transmission of the novel virus and their hospital status. It can be inferred that people who have been infected due to community spread have been more critical than the other mode of transmission.

Plot2

The plot shows the relationship between the age group and hospital status, and it can be seen that people above the age of 60 years are found to be hospitalized with no critical situation, but the majority of the people less than the age of 50 years, are not hospitalized at all despite few showing symptoms. This also supports the statement, “Humans are getting immune…” , In short, it makes it understandable that younger the age more the immunity power, thus very few chances of getting hospitalized and death

The dataset was split with a ratio of 70–30, Where 70% of the data was used to train, and the remaining 30% of the data was used to test and predict to evaluate the accuracy of the model.

Decision Tree

Now using the modified dataset (without Recovered column), a Decision tree model was executed. The below plot depicts the decision tree for predicting the desired attribute “Death” other attributes Age Group, Hospital Status and Transmission.

Decision Tree

It can interpret that patients who fall above the age of 60, have higher chances of death, than patients below the age of 60. It can also be interpreted that people whose medical conditions are critical, as per our dataset, who is in ICU, has higher chances of death, that patients who are hospitalized. Thus the model predicts the output with an accuracy of approximately 91.25%, with an error rate of 8.75%

This implies that the Decision Tree model will answer the question,“ Will the patient be alive or dead?” with an accuracy of 91.25%.

Decision Tree model aids in identifying the attributes that result in overfitting, that inturn predicts higher accuracy for training data and lower accuracy in testing data

Support Vector Machine (SVM)

The above model using the Decision Tree gives a detailed insight into the classification, but to seek a model with better accuracy, the dataset is classified using the SVM method. This model classifies the dataset by fitting it on a hyperplane that best fit the categories.

In our case, we tried to fit our data using the SVM model, and it happens to predict the output with an accuracy of 92.56%. (The plots make it a bit difficult to interpret to output). The predicted data with the above accuracy can conclude that SVM performs better than the Decision Tree model, as it produces more accurate predictions than the previous model.

Thus the model predicts the output for the dataset with an accuracy of approximately 92.56%, with an error rate of 7.44%

This implies that the SVM model will answer the question, “Will the patient be alive or dead?” with an accuracy of 92.56%.

Logistic Regression

Stacey Ronaghan, in her “Machine Learning: Trying to classify your data,” suggests that the Logistic Regression model uses the probability of the binary outcomes to predict the desired result.

The below plot shows the threshold of our test dataset, i.e. predicted outcome, which determines the accuracy of the model for the given dataset.

ROC plot

For the given dataset, it happens to predict the output with an accuracy of 92.33% and an error rate of 7.67% using the Logistic Regression Model. On comparing with the SVM model, SVM tends to predict the outcome with an accuracy of 92.56%, differs the Logistic Regression model by a difference of 0.2%,

This implies that the Logistic Regression model will answer the question, “Will the patient be alive or dead?” with an accuracy of 92.33%.

On comparing SVM and Logistic Regression model,

SVM performs better than Logistic Regression for datasets with multiple attributes or features.

K-Nearest Neighbour (KNN)

Onnel Harrison, in his “Machine Learning Basics with the K-Nearest Neighbors,” the algorithm of KNN assumes that similar things exist in close proximity.

KNN

Our dataset was classified using the KNN model. The plot depicts the best nearest neighbour value, where the model can predict the outcome with the highest accuracy. In our case, K= 43, and the accuracy is estimated to be 93.24%, with an error rate of 6.76%. On comparing with other models, it is evident that the KNN model has predicted the outcome with higher accuracy and lower error rate than the other models taken into account for performing data classification for the dataset chosen.

This implies that the K Nearest Neighbour model will answer the question, “Will the patient be alive or dead?” with an accuracy of 92.33%.

Conclusion

The accuracy of the prediction of the KNN model differs from the SVM model by 0.72%, which is relatively less. Still, in order to draw more accurate conclusions, the KNN model made a better prediction for the dataset. The Decision Tree model gave a clear picture of both visualizing and concluding the outcome without much tuning and thoughts. The difference in accuracy between KNN and Decision Tree is about 2.13%, which does affect determining the most reliable outcome, but on comparing the Logistic Regression model with KNN, the accuracy of KNN leads by roughly 1%.

More the accuracy, Better the model (note on overfitting)

So now answering the thesis statement, “Humans get immune to the novel virus” & “Will the patient be alive or dead?”. The above data classification models have predicted the answer or outcome for “Will the patient be alive or dead?” in the best possible way and to be precise KNN model have expected it more accurately than the other models. Accordingly, the plot 2 in Data Visualization sections also supports the thesis statement, by showing that only patients who above the age of 60 and to be more concise above 80, have high chances of death and patients below the age group of 50 years, tend to be not hospitalized, which partially states that they don’t show signs of getting critical. In addition to it is also being proved that humans are getting immune to the novel virus. Thus the impact of the virus on humans reduces and becomes infective, thereby decreasing the impact and resulting in a significant reduction on death counts.

Supporting the statement data from worldometer it can be seen that the number of new cases reported around the world has started to decrease recently. Also, the number of people getting cured is. higher in the numbers death rate due to COVID19 are reducing day after day, and also the number of deaths reported has also started to decrease. This suffices in supporting the thesis statement.

Thus the classification models have answered for the question “Will the patient be alive or dead?” with an accuracy ranging from 91% to 93.5%, supporting the thesis statement and answering the question based to perform Data Classification.

#Stay_Home

#Stay_Safe

--

--

Vanthian Balasubramanian

An Instrumentation engineer, turned out to an Industrial engineer with a profound passion on Supply Chain | Data Analytics & |Travel|Food|Experience