How to use different algorithms using Caret package in R.

Published in

Coinmonks

12 min readJul 10, 2018

Getting started with the introduction:

Hello fellow readers, this is my first article so please bear with me. There will be errors in grammars (not the code) so apologies in advance. So I’ll be working on House Price Data Set which is a competition in kaggle and apply the caret package in R to apply different algorithms instead of different packages, also applying hyper-parameter tuning.

Competition Description:

The kaggle description on the dataset states as follows:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Problem Introduction:

So the dataset for this competition has around 76 columns with it’s respective missing values cause let’s face it data set without missing values is like life without soul.
The problem we will tackle is predicting the Sales Price of the resiential homes in Ames, Iowa. We have to use regression techniques to predict the SalePrice of the property. This is a Supervised Regression Machine Learning Problem. It’s supervised because we have both the features(data for the House Price) and the target (SalePrice) that we want to predict.
Now for regression problems we can use variety of algorithms such as Linear Regression, Random Forest, kNN etc. In R we have different packages for all these algorithms.
The general idea of this article is “why use different packages for different algorithms when you have one for all ?”.
CARET package contains more than 175 algorithms to work with. Now instead of trying to remember different packages for different algorithms caret allows you to use 1 simple function to create all your algorithms.
Sounds pretty simple, eh?
Well it’s not that simple. But we’ll look into it later.

Roadmap:

Before jumping straight into coding. Let’s keep some guidelines on how we’ll be approaching the problem statement.
1. State the problem statement.
2. Acquiring the data in accessible state.
3. Identifying the missing values and anomalies.
4. Preparing the data for machine learning algorithms.
5. Train the model for different algorithms in caret.
6. Predicting the output with respective models.
7. Submitting the output in kaggle. ;p
Step 1 is ticked off. We have the question “Predicting the Sale Price of the properties based on the data given”.

Data Acquisition:

Well to start with the problem we do need the data. Generally most of the time spent is cleaning the data and in exploring the data as to get the relations between columns and whether we need to make new columns out of the existing ones.
We will not be going too deep into the exploring part as the main theme of this article is on how to implement caret package. But I’ll post a new article if you guys need a base on how to approach exploring datasets. Comment if you need it.
Getting back on the topic. The data is available in kaggle for download. The file format is csv (comma seperated values). This is the general file format while working in data.
The following code loads the data in RStudio, and displays the structure of the data.

# Set the working directory (where you have saved the file)
setwd("E:/Kaggle/Housing Price/")# Read the train and test data
train <- read.csv("train.csv")
test <- read.csv("test.csv")

Data Description:

Here’s a brief version of what you’ll find in the data description file.

SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

Yeah I know it’s a long list. But don’t worry I did tell you that we won’t be doing data exploratory. Actually we won’t be doing exploring of any kind. We’ll simply let PCA deal with the hard work.

PCA? Read this article by Matt Brems. It covers the theoretical aspects of PCA. I’ll do the coding part.

Identify Anomalies/Missing Data:

The dimensions of the training data is 1460x81, and that of the test data is 1459x80. While going through the data in House Price (I didn’t exactly go through the data just wrote a bunch of nerdy code) I realised there were few missing data which is a great reminder of the fact that nothing is perfect in this world.
Missing Data can impact your analysis and machine learning model very drastically. For an introduction of how to handle missing value go through this article.

# Getting the columns with more than 20% of missing values.
miss <- colSums(is.na(train))
miss_prec <- miss/nrow(train)*100
print(miss_prec[miss_prec > 20])col_miss <- names(miss_prec[miss_prec>20])

We see that there are 5 columns: “Alley”, “FireplaceQu”, “PoolQC”, “Fence”, “MiscFeature” which have more than 20% missing data.
Now the thing with missing values is that it is a good practice to impute them with reasonable values but if we explicitly impute values of our choice we may be manipulating the data to our preference and are bound to get wrong model. We should resort to removing the columns containing missing values more than a certain threshold. I keep it till 20%.

# Removing the columns with more than 20% missing value in both train and test data.
train[,c(col_miss)] <- NULL
test[,c(col_miss)] <- NULL

Now that we’ve removed the columns containing more than 20% of the missing values it is time to impute values for the other columns.

Also it was noted that Utilities in train and test data set had a very unique feature. Apparently Utilities in train data set had 2 values: AllPub and NoSeWa. But when one looks in the uniqueness of the Utilities in test data we find: AllPub.

sd <- names(which(sapply(test,is.factor)))
f <- c()
for(i in sd){
  f <- c(f, length(levels(test[,i])))
}
print(sd[which(f == 1)])train$Utilities <- NULL
test$Utilities <- NULL

I’ll be taking the help of mice package to impute the missing values using random forest method.

# Importing the library
library(mice)temptrain <- mice(train, m = 5, method = "rf", maxit = 10)
trainnew <- complete(temptrain, 1)temptest <- mice(test, m = 5, method = "rf", maxit = 10)
testnew <- complete(temptest, 1)

Data Preparation:

One Hot Encoding:

Now that we’ve handled the missing values we will merge the two datasets to create dummy variables.
Why you say? Well you see this is after all a regression problem and if we don’t have continuous or discreet values in the data set it would create a bit of a problem. And we don’t need to complicate stuffs, do we?

housedum <- rbind(trainnew %>% select(-SalePrice), testnew)

Now that the data sets are merged let’s create dummy variables. It ain’t that difficult in R. We will use the dummyVars method of the caret package.

library(caret)housedumnew <-  dummyVars(~., data = housedum)
housedumpred <- predict(housedumnew,housedum)

Now separating the training and testing data sets after creating the dummy variables.

trainnewpred <- housedumpred[1:nrow(trainnew),]
testnewpred <- housedumpred[-(1:nrow(trainnew)),]

PCA:

Hopefully you have a general idea on how PCA works. Even if you don’t, look up here for a theoretical aspect of PCA.

prin_train <- prcomp(trainnewpred, scale. = T)
std_dev <- prin_train$sdev
pr_var <- std_dev^2prop_varex <- pr_var/sum(pr_var)

Let’s plot to understand how many variables we ought to take for the model creation.

# Simple plot b/w Principal Components and Variance.
plot(prop_varex, xlab = "Principal Component", ylab = "Proportion of Variance Explained",
     type = "b")

plot(cumsum(prop_varex), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", type = "b")

We see that the first 150 principal components account for more than 80% of the variance.
Subsetting the first 150 variables in a data frame.

# Train data set
housetrain <- data.frame(SalePrice = train$SalePrice, prin_train$x)
housetrain <- housetrain[,1:151]# Test data set
housetest <- predict(prin_train, newdata = testnewpred)
housetest <- as.data.frame(housetest)
housetest <- housetest[,1:150]

caret Package overview:

One of the biggest challenge beginners in machine learning face is which algorithms to learn and focus on. In case of R, the problem gets accentuated by the fact that various algorithms would have different syntax, different parameters to tune and different requirements on the data format. This could be too much for a beginner.

So, then how do you transform from a beginner to a data scientist building hundreds of models and stacking them together? There certainly isn’t any shortcut but what I’ll tell you today will make you capable of applying hundreds of machine learning models without having to:

remember the different package names for each algorithm.
syntax of applying each algorithm.
parameters to tune for each algorithm.

All this has been made possible by the years of effort that have gone behind CARET ( Classification And Regression Training) which is possibly the biggest project in R. This package alone is all you need to know for solve almost any supervised machine learning problem. It provides a uniform interface to several machine learning algorithms and standardizes various other tasks such as Data splitting, Pre-processing, Feature selection, Variable importance estimation, etc.

Follow this article to get a good overview on how caret package works.

Now that we’ve completed with the data crunching part let’s focus our attention on creating training models for predicting the test data set.

library(caret)

The best thing about caret package is the number of algorithms it allows us to use more than 175 algorithms in one single package. Well talk about overkill.
Click here to get to know the algorithms that caret package follows.

# See available algorithms in caret
modelnames <- paste(names(getModelInfo()), collapse=',  ')
modelnames

And if you want to know more details like the hyperparameters and if it can be used of regression or classification problem, then do a

modelLookup(algo)

train(): This function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model and calculates a resampling based performance measure.
Besides building the model train() does multiple other things like:
1. Cross validating the model
2. Tune the hyper parameters for optimal model performance
3. Choose the optimal model based on a given evaluation metric
4. Preprocess the predictors (what we did so far using preProcess())
trainControl(): The train() function takes a trControl argument that accepts the output of trainControl().
Inside trainControl() you can control how the train() will:
1. Cross validation method to use.
2. How the results should be summarised using a summary function
Cross validation method can be one amongst:
‘boot’: Bootstrap sampling
‘boot632’: Bootstrap sampling with 63.2% bias correction applied
‘optimism_boot’: The optimism bootstrap estimator
‘boot_all’: All boot methods.
‘cv’: k-Fold cross validation
‘repeatedcv’: Repeated k-Fold cross validation
‘oob’: Out of Bag cross validation
‘LOOCV’: Leave one out cross validation
‘LGOCV’: Leave group out cross validation
The summaryFunction can be twoClassSummary if Y is binary class or multiClassSummary if the Y has more than 2 categories.
By setting the classProbs=T the probability scores are generated instead of directly predicting the class based on a predetermined cutoff of 0.5.

Training Model:

Creating the trControl parameters on how our algorithm should work.
We’ll be using repeatedCV. CV stands for cross validation technique.
RepeatedCV states the number of times we will be repeating the cross validation. Think of it as a for loop to understand the data well.

control <- trainControl(method = "repeatedcv", repeats = 3)

Pretty simple, eh? Well as you go deeper into the caret package you will need more parameters to help you out. But for now we’ll stick to the basics.

Let’s create different models for different algorithms:

# Decision Tree:
model_dt <- train(SalePrice~., data=housetrain, trControl = control, method = "rpart",tuneLength = 15)# Random Forest:
model_rf <- train(SalePrice~., data=housetrain, trControl = control, method = "rf", tuneLength = 15)# kNN
model_knn <- train(SalePrice~., housetrain, trControl = control, method = "knn", tuneLength = 15)

Be warned this will take a lot of time to compute. Like a lot of time. Especially Random Forest.
You will observe that all the algorithms has almost same type of syntax. Well that’s the perk of caret package. You don’t need to remember all the parameters of different packages.
Although that’s also the downfall of caret package as in order to tune the model according to your specific needs it is not that helpful as it doesn’t take into account all the parameters.

Predicting the test data sets:

Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features (the model is never allowed to see the test answers).

# Decision Tree
pred_dt <- predict(model_dt, housetest)# Random Forest
pred_rf <- predict(model_rf, housetest)# kNN
pred_knn <- predict(model_knn, housetest)

And voila! we have the output for the test data sets.
Now we need to save this predicted data in an organized form so as to submit it in kaggle.

# Decision Tree
dt_df <- data.frame(ID = row.names(housetest), SalePrice = pred_dt)
write.csv(dt_df, file = "DecisionTree.csv", row.names = F)# Random Forest
rf_df <- data.frame(ID = row.names(housetest), SalePrice = pred_rf)
write.csv(rf_df, file = "RandomForest.csv", row.names = F)# kNN
knn_df <- data.frame(ID = row.names(housetest), SalePrice = pred_knn)
write.csv(knn_df, file = "kNN.csv", row.names = F)

Now all you have to do is submit the created csv files in this link.

Conclusion:

With these codes we have come to the ending part of this article. At this point if you want to improve the code, we could try hyperparameter tuning on a different sets of algorithms. And perhaps more of the exploring the data sets to get a general idea on the variables.
Also more data available better it is for predicting the data. It would be encouraged if anyone could improve the performance of the model without using different algorithms but rather by crunching the data.
For those who are genuinely interested in learning the ins and outs of caret package I highly recommend this article which I stumbled upon and has helped me immensely.
Moreover, I hope everyone who made it through has seen how accessible machine learning has become and is ready to join the welcoming and helpful machine learning community.

As always, I welcome feedback and constructive criticism! My email is mervyn.akash10@gmail.com .

Happy coding!!