TIL: Preprocessing Data in R for ML

Warning: I don’t have any formal traning in machine learning, and I am merely just getting a sneak peak before I most likely take an introductory course in it while I go back to school. I worked on learning some very basic applied ML a while ago, and needed to refresh myself before continuing and thought I would write something.

How do you get started preprocessing data for machine learning?

The first thing you must do is set your current working directory. This is the directory where your .r and .csv files live. CSV stands for comma-seperated values and the csv file is a common format for storing data. If you are using R Studio, there is a files tab in the bottom left window where you can select your current working directory. The code below takes a csv file, reads the file, and then selects coloumns 2 through 6 (maybe you are only interested in part of the data).

my_dataset = read.csv(‘name_of_file.csv’)
my_dataset = my_dataset[2:6]

Let’s say you wanted to predict whether a team will win or lose based off some data and you have a WinOrLose column, and the values are ‘win’ or ‘lose’. In order to work with this, we need to enumerate these values. In other words, it would be nice if instead of Win or Lose, we had 1 or 0. The factor( ) function takes in the column you want to enumerate. The levels parameter is used to assign values from a matrix. It takes the values from WinOrLose (lose, win), and assigns them to the matrix vector assigned to levels. The c( ) function is combines the values passed in to create a vector. Be careful with the ordering as it assigns the values in the column in sorted order to the values in the vector. In this case l comes before w, and thus lose gets assigned 0.

dataset$WinOrLose = factor(dataset$WinOrLose, levels = c(0, 1))

Now that you have the data, you want to create a training set and a test set. The training set is part of supervised learning, and it is used to fit the parameters for the classifier. The classifier is the algorithm that classifies the data in to categories per se. A model is then fitted to the training set using the classifier, and this model is then tested against an indepdent test set. There is a package in R called ‘caTools’ that has a function that splits data in to a training and test set. The function sample.split() takes in a Y vector, and a split ratio. In the example below I am splitting my_dataset such that the relative amount of WinOrLose (some column in my data) data is split with 3/4th of the data making up the training set and 1/4 of the data making up the test set. I won’t go in to detail as to how to decide how large your training set is partly because I dont fully understand it outside the fact that you typically need a rather large training set… I think. For more information about how to use a given function, you can hit f1 to inspect the function and learn about what it does, and what parameters it takes in.

Furthermore, we need to assign the subsets of what we split to the training set and test set. The subset( ) function lets us pass in a some data, and essentially a selector. It is saying that take the 75% of the data it split and assign it to the training set, and then take the rest and assign it to the test set

split = sample.split(my_dataset$WinOrLose, SplitRatio = 0.75)
training_set = subset(my_dataset, split == TRUE)
test_set = subset(my_dataset, split == FALSE)

Sometimes the values in your data vary widely and this can be an issue for many machine learning classifiers. In order to fix this problem we need to essentially normalize the data or using ‘feature scaling’. The function scale( ), which is part of the base package (you don’t need to install anything), will help us do this. scale( ) takes in a numeric matrix. You pass in the the column that needs to be normalized as a matrix. In the example below, we are scaling the 3rd column from the right. We also need to make sure we scale for both the training and test set.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

You are now ready to create a classifier. In the next post, I will talk about building a simple classifier, making a prediction and creating a confusion matrix. I might make a simliar post on how to do the same thing in python. After doing both, I thought R was much eaiser to set up. Hope this was helpful.