Chapter 2 : Data Preprocessing in Python and R (Part 03)

Yashithi Dharmawimala
Machine Learning for beginners
5 min readNov 2, 2020

We are at the last stage of preparing our dataset for the real fun to begin!

If you haven't read the posts before this, I highly recommend that you go back and follow up on them in order to proceed!

Splitting the Dataset into the Training Set and Test Set

Why do we have to split our data set into two?

Here’s why! Suppose you are taught some course material in class. The best way to be further thorough with the material is by taking a test. You write answers for questions and verify if you are correct. If you are wrong you learn from your mistake which makes you a better student than you were earlier.

The same method applies to your ML algorithm. It initially learns using the training set and is tested using the test set.

Let’s see how to do this in python and R.

Python :

#Spliting the dataset into the Training set and Test setfrom sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.2,random_state=0)

As earlier we import train_test_split from sklearn.cross_validation. As the name suggests what this would do is break the dataset into a training set and test set.

The second line of code defines the arrays as X_train (Training set of the X matrice), Y_train (Corresponding training set of the Y vector), X_test ( Test Set of X) and Y_test (Corresponding test set of Y). Here the test_size variable is basically the percentage of data which will be used for the size of the test set. ( 0.2 = 2% ) Usually, its recommended to use 0.2, 0.25 or 0.3 for the test set variable.

After executing this code you can observe that your data set is now split into 4 as follows:

Training and Test Data Sets in Python

R :

#Splitting the dataset into Training set and Test Setinstall.packages(‘caTools’)
library(caTools)
set.seed(123)
split=sample.split(dataset$Item.Purchased, SplitRatio = 0.8)
training_set=subset(dataset,split==TRUE)
test_set=subset(dataset,split==FALSE)

In R we have to install a library called ‘caTools’ for the splitting purpose. After installing this package we must select this library. This can be done manually by clicking ‘caTools’ in packages or by executing the code in line two. set.seed(123) makes sure that everyone executing this code gets the same output.

Then we execute the fourth line of code. Here the parameters for the split function is the ‘Y’ vector which it ‘dataset$Item.Purchased’ in R and the split ratio. Here the split ratio is for the training set (not the test set), therefore it is set to 0.8.

If you type split in the console you can see that the fourth line of code returns an array with TRUE and FALSE as follows:

Here, the rows that are labelled in TRUE belong to the training set and the rows that are labelled as FALSE belong to the test set which can be extracted by executing the remaining lines of code. Execute all the above code and verify if this is the output that you received :

Test Set in R
Training Set in R

Feature Scaling

In our dataset, we can see that there is a huge difference in values between the salary column and the age column. Since the machine learning algorithm is unable to identify the true meaning of these attributes the values in the column age will be dominated by the values in the Salary column. Due to this issue, we apply feature scaling so that all these features will be on the same scale.

There are two ways in which we can do feature scaling :

Methods of feature scaling

Let’s see how we can do this in python and R.

Python :

# Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

As earlier, we have to import the relevant libraries and use them to do the feature scaling as shown above. These will make sure that all the data is in between -1 and +1 as follows:

Note that we have done feature scaling for dummy encoding as well, however, we refrain from doing feature scaling to our Y data set as it’s just zero and one.

R :

#Feature Scaling
training_set=scale(training_set)
test_set=scale(test_set)

The above code will do the feature scaling for us.

Wait? What! Did you get an error?

Yeah well, I got that too…..

Here’s why! The error says that ‘x’ must be numeric and by ‘x’ it means the training_set and test_set. You may think that they all contain numbers but in R the factor function does not consider them to be numeric! Due to this issue, we refrain from applying feature scaling to dummy encoded data in R.

Try executing this piece of code!

#Feature Scaling
training_set[,2:3]=scale(training_set[,2:3])
test_set[,2:3]=scale(test_set[,2:3])

In here we specify the two columns that actually need feature scaling. Your output should look something like this :

Feature Scaling done for the training_set in R
Feature Scaling done for the test_set in R

AND VOILA!

You're officially done with data preprocessing. Using this knowledge you can prepare absolutely any data set that you get!

Congratulations on getting this far! See you in my next blog post to learn the fun parts of machine learning!

--

--