Classification of COVID-19 victims using Machine Learning Models.

Introduction(Need of the hour)

Published in

Analytics Vidhya

5 min readAug 28, 2020

Mostly Machine Learning has found its application in Medical Field, Automation cracking the Number crunching Algorithms etc.

COVID-19 is one of the most dark era humanity has ever faced,it has totally endeavored the year 2020 and leaving behind a dark spot for our future generations.

Researchers,scholars and professionals in their respective fields are making ends meet to get us out of this pandemic situation.

It’s time we contribute to their work within our respective fields of expertise.This blog is about the classification of COVID victims as either mild,serious or critical.Based on this we can suggest whether patients can be put into quarantine, hospitals or under ventilators,It is just a naive approach towards solving the problem, it can be taken to next level for with critical thinking and making use of sophisticated algorithms.

About the Dataset

The dataset is taken from Kaggle and it contains many attributes which are essential for building the model, it also contains the medical history of a particular patient, blood samples, diabetes, diseases etc.

The dataset is huge and contains both floating point data and categorical data.

Key Observations

The key observation here is the dataset is sparse as a whole and doesn’t have all the required data under a particular field, some data is missing as the samples were not collected or aren’t available for that field.
All the attributes in the dataset don’t contribute significantly to the Machine Learning Model we are going to build.
Imputing categorical features may lead to spurious tuples which doesn’t make sense.So we are ignoring categorical features in our Dataset.

These are the few images of the dataset

This is just a sample view of the datasets.

As we can see there are many values which are unknown or Nan values so we have to fill the holes before applying the model.

How to Impute the Dataset ?

Let’s just explore the conventional methods first before jumping to our Idea.

To simply ignore the column — But that can’t be done in this case since this is a sparse dataset after the deletion we would be left with only a handful of attributes and significant attributes maybe lost.
To impute with central Tendency —In traditional Machine Learning Models we use to impute the NaN values with mean of the entire column, since every column is sparse it is difficult to find the mean as the mean found would not be near to the actual mean because of too many missing values.

So in such cases we have to think out of the box as desperate time needs desperate measures.

We train a model to first predict the missing values but which is not simple either as we have to have the expertise over the domain and also on the stats.

But this is to show the application of Multiple Linear Regression in discarding the insignificant attributes.

The way Imputation has been performed on the dataset is:

Neglecting the categorical data for the first phase of classification, as imputing them would make no sense because of their binary “YES” or “NO” property.
Imputing the real values with suitable measures of central Tendency.

How the Machine Learning Model is Applied?

Once we Impute the dataset now we need to eliminate insignificant attributes from the dataset using Backward Elimination(Multiple Linear Regression).

The concept of backward Elimination is explained in detail in my previous blog, this is just the continuation of the previous blog.

Once the above procedures are done now its a downhill from here to apply various classification Models.

The models that gave the best results were

Support Vector Machine(90.8%)
Decision Tree Classifier(89.54%)

I made use of Confusion Matrix to visualize how the output is scattered across various class.

The code can be seen below

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltdataset = pd.read_csv("dataset1.csv",encoding = 'latin1')
datasetx = dataset.iloc[:,1:-4].values
y = dataset.iloc[:,-4].values#Imputing values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN',strategy = 'mean')
imputer = imputer.fit(x[:,:])
x[:,:] = imputer.transform(x[:,:])#LabelEncoding Y part
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y) #Dividing the dataset into training and testing
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 0)#Standardize using StandardScaler
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
x_train = sc_X.fit_transform(x_train)
x_test = sc_X.fit(x_test)import statsmodels.formula.api as sm
#Applying the Backward Elimination
def BackwardElim(x,sl):
    n = len(x[0])
    for i in range(0,n):
        regressor_OLS = sm.OLS(y,x).fit()
        maxp = max((regressor_OLS.pvalues).astype(float))
        if maxp > sl:
            for j in range(0,n-i):
                if(maxp == regressor_OLS.pvalues[j].astype(float)):
                    x = np.delete(x,j,1)
                    
    regressor_OLS.summary()
    return xsl = 0.05
x_opt = x
x_model = BackwardElim(x_opt,sl)
x_model.shapefrom sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_model,y,test_size = 0.2,random_state = 0)#Applying RandomForest
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10,criterion = 'entropy',random_state = 0)
classifier.fit(x_train,y_train)y_pred = classifier.predict(x_test)from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
classifier.score(x_test,y_test)#Applying SVM
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear',C=1)
classifier.fit(x_train,y_train)y_pred = classifier.predict(x_test)from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
classifier.score(x_test,y_test)#Applying Decision Tree
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy',random_state = 0)
classifier.fit(x_train,y_train)y_pred = classifier.predict(x_test)from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
classifier.score(x_test,y_test)

The above is the exact code I have used for classification if somebody needs the dataset they can message me.

Future Scope

The model can be made further sophisticated both in terms of its Application and also its Accuracy.

We can make use of the categorical variables which we had neglected in order to increase the accuracy.
We can tune the parameters using k-fold cross validation, Gradient Boosting ,Ensemble learning etc.

The rest is left to you to come up with brainstorming ideas to tackle this situation.