Chapter 2 : Data Preprocessing in Python and R (Part 01)

Yashithi Dharmawimala
Machine Learning for beginners
6 min readNov 2, 2020

Why is Data Preprocessing Essential?

So why not just jump into training a machine learning model straight-away? Why is data preprocessing important?

Here’s why! Data preprocessing is one of the most important steps in Machine Learning as the data and information that can be derived from preprocessing directly affects the ability of our model to learn.

For instance, in my previous post, I took an example data set that mapped a tumour being malignant or benign based on the tumour size and the age of a person. Suppose in one instance, one of these parameters were missing, how would that missing data affect the function line? If you think about it, that single missing data could completely change the function line, thereby, making predictions which are far less accurate.

Thereby, it’s quite evident that data preprocessing is crucial when it comes to Machine Learning!

Prerequisites

You must have either R Studio or/and python installed to do the coding examples in this post. I personally use Spyder for python. So here are some video links that you can follow in order to install these (Windows 10):

[ The data set and the code included in these posts can be accessed on GitHub]

Dependent and Independent Variables

Before we start with data preprocessing, there’s a simple yet important concept of dependent and independent variables that we must know. Consider the data set given below :

Here we can observe that some item is purchased based on the 3 factors; Country, Age, Salary. Therefore we can say that Country, Age and Salary are independent variables while Item Purchased is a dependent variable (as it depends on the above mentioned independent variables).

Now that we got that covered, let’s step into some coding, shall we?

Importing Libraries

So what exactly is a library? A library is a tool that can be used to do specific jobs. It’s basically a collection of related pieces of code that have been compiled. Libraries make our lives much easier because all we have to do is give it some input and we receive the output we need! Yeah, simple as that!

Before going to the data preprocessing sections let’s import some useful libraries!

Python :

In python we will be needing mainly 3 libraries :

#Importing Librariesimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • numpy is the library that contains all the mathematical tools that we will be needing and for ease, we added a shortcut name as ‘np’.
  • The second library is matplotlib and we only need the sublibrary pyplot which we have named as ‘plt’. This library will be used mostly to plot graphs.
  • The third and final library is pandas which was named as ‘pd’ and we use this library to import and manage data sets.

R :

For the time being, we don't have to import any libraries as most of them are already installed under packages in R!

Setting the Working Directory

Python :

Before we import the data set we must set a working directory. The folder set as the working directory must be the folder which contains the dataset. Here’s how you can do that :

On the right of your screen you will find ‘File explorer’ as shown in the figure below:

Click ‘File Explorer’ and select the folder in which you have saved your data set. Then click the icon on the right as shown below to set that folder as your working directory.

If in case that does not work, you can save your python file in the folder of the dataset and run the file which will also automatically set that folder as your working directory.

R :

Open up R studio select your folder from ‘Files’ at the right of your screen as shown below:

Next, select ‘More’ and select ‘Set as Working directory’ as follows and you're good to go!

Importing the Dataset

Python :

#Importing the datasetdataset = pd.read_csv(‘Data.csv’)

Here you can observe that the Data.csv file is assigned to a variable via the pandas library. Now select all those lines of code (including the importing library code) and press ‘ctrl + alt + ENTER’ to execute. Then you can verify if the code has run correctly from the console on the right-bottom. Now click ‘Variable Explorer’ and then select ‘dataset’. By now you should be able to view your dataset as follows :

(Note that in python the index starts from 0, however, in R the index starts from1)

As you can see the salary column is not that great to work with, so let’s change that! Click the ‘Format’ button on the left corner and change the float formatting from %.3g to %.0f. Then you can see that the values are changed to a simpler version.

In machine learning, we handle our dataset by putting them into matrices as we have to distinguish the matrix of features and the dependant variable vector. So first let’s create a matrix for our independent variables.

As discussed above, we already know that Country, Age and Salary are independent, so let’s put these values into a variable matrix, ‘X’ and the Items purchased column into another variable vector, ‘Y’. This is how we can do that :

X = dataset.iloc[:, :-1].valuesY = dataset.iloc[:, 3].values

For those of you who are new to this python syntax here’s a brief explanation of that code: X - The first colon represents the fact that we need all the lines in our dataset and to the left of the comma we typed a colon followed by minus one which represents that we take all the columns except the last one. .values imply that we need all the values from the dataset.

If you figured out the pattern you can see that the Y variable takes all the lines and only the 3rd column from the dataset.

After you run this code you can always type X or/and Y in the console to make sure you have extracted the columns correctly as shown below:

Now let’s move on to see how this is done in R!

R :

The code for doing this is pretty simple and quite self-explanatory :

#Importing datasetdataset = read.csv(‘Data.csv’)

Then you can run this code and view the dataset by clicking ‘dataset’ on the right of your screen as follows :

(Note that unline in Python, the index starts at 1)

GOOD NEWS! You don't have to put your dataset into matrices as we did in python.

Feeling overwhelmed?? I know I did when I first started out. So take a break and check out the next blog post Data Preprocessing in Python and R (Part 02)’ to learn more about data preprocessing!

--

--