Chapter 2 : Data Preprocessing in Python and R (Part 02)

Yashithi Dharmawimala
Machine Learning for beginners
7 min readNov 2, 2020

--

In the previous blog post, we walked over importing libraries and datasets in both R and Python, and in this post, we will be looking into data preprocessing. Yes! Finally!

If you haven't read the previous post yet, I suggest you read that before proceeding with this article! [Chapter 2 : Data Preprocessing in Python and R (Part 01) ]

Missing Data

Let’s now look at how we can handle the issue of missing data in our dataset. Here is the dataset we used as earlier :

Here we can see that there two data elements missing in the Age column and in the Salary column. So how exactly can we handle this problem?

One way you could think of could be completely removing the rows containing missing data. However, this is highly discouraged as these rows may contain crucial information that must be considered.

Instead here are some way we can address this issue:

1) Replace Missing data with the Median of the column

Let’s get coding!

Python :

To do this task we will be using a library here as well :

#Missing Datafrom sklearn.preprocessing import Imputer
imputer = Imputer (missing_values = 'NaN' , strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In the above code, you can see that we use the preprocessing library available in sklearn and from this library we use the Imputer class that allows us to take care of missing data. Now that we have the class, we are going to make an object (imputer) from this imputer class (Imputer).

For learning purposes, you can press ‘ctrl + i’ and check the documentation for the Imputer class to obtain the parameters.

Since python represents missing data as ‘NaN’ (Refer to the dataset) we set the value of the parameter ‘missing_values’ to ‘NaN’. The strategy we are going to use here is replacing the missing value with the mean value of the column, therefore, we assign ‘mean’ to the ‘strategy’ parameter. (Note that the default parameter for strategy is mean, thereby, explicitly mentioning this strategy is not necessary). The next and final parameter we need is the axis: axis = 0 represent the columns and axis = 1 represents the rows. Since we need the mean of the columns we say axis = 0.

Next, we have to fit our imputer object to our X matrix containing the data. Thereby, we type the 3rd line of code. However, we are not going to fit the imputer for all columns, we only do this for the columns which contain missing data i.e. Age and Salary. Thereby, we can use this code to do that, X[:, 1:3] : Represents all rows and the 1st and the 2nd columns (Upperbound is excluded)

Lastly, we assign this output to our matrix by the code : X[:, 1:3] = imputer.transform(X[:, 1:3])

And you are done! You have now gotten rid of missing data in python! You can now type X on the console to verify if this has been done accurately as follows:

Missing data have been replaced with the mean in Python

The other ways in which we can do this is by replacing missing data with the median or by replacing missing data with the most frequently used value. This can be easily done by referring to the documentation of the imputer class.

R :

# Missing Datadataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)

Unlike python, there are only two lines of code to do this! However, here we have to write a line of code for all the columns with missing data.

The first line of code : na is a function that tells if the value in the function is missing or not. So we first check that using the ifelse statement. This will return true if the value in the column is missing, if not it will return false. The second parameter is the value that should be returned if the condition is true. Therefore, we have to output the average(mean) of the column. This can be done by ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE) . This returns the average of the Age column if the cell is empty. If the data is not missing then we return the existing value.

The second line of code is the same above for the Salary column.

After executing the above lines of code we can observe the following :

Missing data have been replaced with the mean in R

Categorical Data

In our dataset, we can see that the Country and Item Purchased are categorical as the country column contains Sri Lanka, India, America and France, and, the Item Purchased column contains Yes and No. Since machine learning deals with equations and numbers it’s important that we get rid of categorical data.

Now let’s see how to code this using python and R.

Python :

#Encoding Categorical Datafrom sklearn.preprocessing import LabelEncoderlabelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

As earlier, we import the LabelEncoder class from the sklearn preprocessing library and made an object from that class. From the third line of code, we fitted the labelencoder_X object to the first column of our matrix X which returns the first column (Country) of the matrix X encoded.

Once the first 3 lines of code are executed you will be able to observe that the country names have been replaced by 1,2,3 and 4 as follows:

However, the issue with this is that the machine learning algorithm could think that there is a hierarchy among the countries. For example, it could think that Sri Lanka is greater than America (3>0), whereas in real life there is no such thing.

In order to fix this issue, we use the dummy encoding method where instead of assigning the countries numbers we break them up in to separate columns.

Dummy Encoding

Therefore in our example, we will be breaking the country column into 4 other columns, each representing its own. for example: America will have one 1 in its respective column and zeros in the rest of the columns.

To do this in python we have to use another class as follows:

from sklearn.preprocessing import LabelEncoder, OneHotEncoderonehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

The first line of code imports the OneHotEncoder (Class responsible for dummy encoding). In the second line of code, we create an object of the OneHotEncoder for the Country column which is specified using the categorical_features=[0] piece of code. As earlier we use fit transform to change our X matrix.

By executing the above 3 lines of code we can receive this output as shown below : (open using the variable explorer and convert to %.0f for better understanding)

Dummy Encoding using python

Earlier you saw that Sri Lanka was represented as 3, and after dummy encoding, we can observe that the 3rd column contains 1 and the rest of the country columns contain 0.

Since the Item Purchased column is the dependent variable the machine learning algorithm knows that there is no such hierarchy, therefore, we can use only the label encoder as shown in the above code segment.

You can observe the following output after executing that code :

R :

In R this process is quite simple as it contains the factor function which transforms your categorical variables into numeric categories. This can be done by the following code:

# Encoding categorical datadataset$Country = factor(dataset$Country,
levels = c('America', 'France','India','Sri Lanka'),
labels = c(1, 2, 3, 4))
dataset$Item.Purchased = factor(dataset$Item.Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))

In the code, you can clearly see that the 4 countries are mapped to their corresponding labels and the same is done for the Item Purchased column. Executing the above lines of code will result in the following:

Output after handling categorical data in R

Congratulations! You now know how to handle missing data and categorical data in both python and R. However, we are not done yet! There’s one more post that you have to follow up to grasp the entire concept of data preprocessing.

Hang tight, we are almost there!

--

--