randomforest in rpart — part 1: getting some data
A little bit of house keeping:
- I do all my R work in RStudio. It’s not the greatest IDE in the world but it’s quite simple and easy to use
- I’m assuming anyone reading this has some basic R knowledge. If you want to learn R there are some really good resources that are only a quick Google away including Hadley Wickham’s R for Data Science
- If you need more information on any of the functions I use R help is really easy to get to. Simply type ?functionname into the console and an explanation and some examples will appear
- I will put the code on GitHub once it is in a more complete state
The first part of this magical journey involves picking a dataset and a problem to solve. I’m in the enviable position of picking the data and the problem so i’ve decided to use the diamonds dataset.
The diamonds dataset is included in the package ggplot2 and is explained here. A quick look at the structure of the data is below:
# get the data — it comes as a tbl_df so make it a normal data.frame
data <- as.data.frame(diamonds)
str(data)Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Using str() we can see we have some numeric data and some ordered categoric data. The R function summary() is also useful when beginning to look at a data set.
In my first iteration I want to keep it simple and solve a simple binary classification problem. There are no binary columns in this data set so i’m going to create a new derived variable called is.premium. This column will be 1 if the observation of the cut variable is “Premium”, otherwise it will be 0.
# convert to a binary classification problem — ‘is.premium’
data$is.premium <- sapply(data$cut, FUN = function(v) ifelse(v == “Premium”, 1, 0))
table(data$is.premium)A quick table of the new variable shows that the is.premium variable has 13791 1’s, with the remaining observations being 0. This (thankfully) matches the 13791 observations with a cut of “Premium”.
This is where I leave part 1.
