Data Preprocessing : COVID-19 Dataset Practical Implementation

Published in

ML_with_Arpit_Pathak

6 min readMay 31, 2020

Hello readers , this blog is a practical implementation of data preprocessing on a dataset using python . If you haven’t gone through the previous blog about Data Preprocessing the go through it first .

Data Preprocessing with Numpy and Pandas

This blog explains the pre-processing of the data by using two of the Python libraries : Numpy and Pandas …

medium.com

There are 7 key steps to perform Data Preprocessing on any type of dataset which i am going to explain with an example of COVID-19 dataset as follows —

1. Acquire the Dataset

The first step to Data Preprocessing is to get the dataset about which you want to create the dataset . Here i want to create a machine learning model to predict the risk of getting suffered from the COVID-19 for a person . So , i need to collect the factors over which i can train my model to predict the risk .

COVID-19 can be predicted in a person by analyzing the symptoms , travel history to hotspot places , age and gender .

For this i have collected a dataset which contains the previous information about the people on whom the survey was carried out by asking them some questions about their age , gender , symptoms and travel history to infected places and then they were tested for being suffered or not . Here , my first step is completed .

2. Import all crucial Libraries

Now , the main practical task begins . We start our python IDE for carrying out the Data preprocessing task . For this , i am using Jupyter Notebook of Anaconda . Let us first import all the required libraries —

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

3. Import the Dataset

df = pd.read_csv( “COVID19.csv” )

4. Identify and Handle Missing Values

Let us identify if our dataset contains any missing values or not . See the information of each column of dataset for any missing number of values . If the number of rows of dataset is not equal to the non-null count of a column , there are missing values in that columns .

df.info()
or
df.isnull().sum()

Ways to handle the null values are as follows —

+ Remove Null Values

df.isnull().sum()

+ Replace Null Values

Removing the records having null values is a loss of information from the data . Therefore , the best way is to replace the null values with the central tendency(mean / median / mode) of the column in which they are present .

5. Encoding the Data

Encoding the data refers to converting the data into a form that is understandable by the machine which includes converting categorical data into numerical data . Other encoding we can do is the grouping of the continuous values of a variable into certain bins of ranges and then convert them into categories to numerical data . Let us see how we can achieve it —

In the above figure , we can see —

(a) gender : It is a binary categorical dataset with values “Male” and “Female” . We can covert this column into numerical data using Label Encoder as follows —

le = LabelEncoder()
df[“gender”] = le.fit_transform( df[“gender”] )
# ‘Female’ : 0 , ‘Male’ : 1 ….Alphabetically label encoded .

(b) Corona result : This is our output column having 3 values viz. “No risk” , “Less Risk” , “High Risk” . We convert this column into numerical valued by using LabelEncoder .

le = LabelEncoder()
df[“Corona result”] = le.fit_transform( df[“Corona result”] )
‘High Risk’ : 0 , ‘Less Risk’ : 1 , ‘No Risk’ : 2

(c) age and body temperature : These 2 are the continuous values . So , we first group them into the bins or categorical data and then we will use one-hot encoding over them . Let us see how —

bins= [0,10,20,30,40,50,60,70,80,100] #initialising bins for age column
labels = [‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i’] #initialising labels for bins
#creating a column that groups all age values into or initialised bins
Age = pd.cut(df[‘age’], bins=bins, labels=labels, right=False)
#creating a column in dataset for our binned column of Age
df[“Age”]=Age
#removing previous column of age
df=df.drop([‘age’],axis=1)
#doing the same for body temperature
bins= [96,98.6,102,110]
labels = [‘normal’,’fever’,’high fever’]
Temperature = pd.cut(df[‘body temperature’], bins=bins, labels=labels, right=False)
df[“Temperature”]=Temperature
df=df.drop([‘body temperature’],axis=1)

Now , using one hot encoding through pandas dummies for converting them to numerical values —

# Applying one-hot encoding using pandas dummies .
df=pd.concat([df,pd.get_dummies(df[‘Age’],drop_first=True)],axis=1)
df = df.drop ([‘Age’],axis=1)
df=pd.concat([df,pd.get_dummies(df[‘Temperature’],drop_first=True)],axis=1)
df = df.drop ([‘Temperature’],axis=1)
df.head()

Now , all the wrangling or cleaning of data is completed so that we can move to next step .

6. Splitting the Dataset into Train and Test Data

Now , the time comes to split the dataset into training data and the testing data . But first we need to split the dataset into X (independent inputs) and y (dependent outputs ) .

In our dataset of columns —
[‘Sno’, ‘gender’, ‘Dry Cough’, ‘sour throat’, ‘weakness’,
‘breathing problem’, ‘drowsiness’, ‘pain in chest’,
‘travel history to infected countries’, ‘diabetes’, ‘heart disease’,
‘lung disease’, ‘stroke or reduced immunity’, ‘symptoms progressed’,
‘high blood pressue’, ‘kidney disease’, ‘change in appetide’,
‘Loss of sense of smell’, ‘Corona result’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’,
‘h’, ‘i’, ‘fever’, ‘high fever’]
X = [‘gender’, ‘Dry Cough’, ‘sour throat’, ‘weakness’,
‘breathing problem’, ‘drowsiness’, ‘pain in chest’,
‘travel history to infected countries’, ‘diabetes’, ‘heart disease’,
‘lung disease’, ‘stroke or reduced immunity’, ‘symptoms progressed’,
‘high blood pressue’, ‘kidney disease’, ‘change in appetide’,
‘Loss of sense of smell’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’,
‘h’, ‘i’, ‘fever’, ‘high fever’]
y = [‘Corona result’]

X = df.drop([‘Corona result’ , ‘Sno’] , axis = 1)
y = df[‘Corona result’]

Now , the data needs to be divided into train and test data —

X_train , X_test , y_train , y_test = train_test_split(X , y , random_state = 2 , test_size = 0.20)

7. Scaling of Training Data

Here comes the last step of data preprocessing where we finally do the scaling of the input data so that the model takes every input without being bias towards one of them .

ss = StandardScaler()
train = ss.fit_transform(X_train)

For reference , got to link to my

Github Repository — — click here

So , this was all about practical implementation of the Data Preprocessing on the Covid-19 dataset . Hope it was an informative blog for you . Thank you..!!