How to use different algorithms using Caret package in R.

Mervyn Akash
Coinmonks
12 min readJul 10, 2018

--

Getting started with the introduction:

Hello fellow readers, this is my first article so please bear with me. There will be errors in grammars (not the code) so apologies in advance. So I’ll be working on House Price Data Set which is a competition in kaggle and apply the caret package in R to apply different algorithms instead of different packages, also applying hyper-parameter tuning.

Competition Description:

The kaggle description on the dataset states as follows:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Problem Introduction:

So the dataset for this competition has around 76 columns with it’s respective missing values cause let’s face it data set without missing values is like life without soul.
The problem we will tackle is predicting the Sales Price of the resiential homes in Ames, Iowa. We have to use regression techniques to predict the SalePrice of the property. This is a Supervised Regression Machine Learning Problem. It’s supervised because we have both the features(data for the House Price) and the target (SalePrice) that we want to predict.
Now for regression problems we can use variety of algorithms such as Linear Regression, Random Forest, kNN etc. In R we have different packages for all these algorithms.
The general idea of this article is “why use different packages for different algorithms when you have one for all ?”.
CARET package contains more than 175 algorithms to work with. Now instead of trying to remember different packages for different algorithms caret allows you to use 1 simple function to create all your algorithms.
Sounds pretty simple, eh?
Well it’s not that simple. But we’ll look into it later.

Roadmap:

Before jumping straight into coding. Let’s keep some guidelines on how we’ll be approaching the problem statement.
1. State the problem statement.
2. Acquiring the data in accessible state.
3. Identifying the missing values and anomalies.
4. Preparing the data for machine learning algorithms.
5. Train the model for different algorithms in caret.
6. Predicting the output with respective models.
7. Submitting the output in kaggle. ;p
Step 1 is ticked off. We have the question “Predicting the Sale Price of the properties based on the data given”.

Data Acquisition:

Well to start with the problem we do need the data. Generally most of the time spent is cleaning the data and in exploring the data as to get the relations between columns and whether we need to make new columns out of the existing ones.
We will not be going too deep into the exploring part as the main theme of this article is on how to implement caret package. But I’ll post a new article if you guys need a base on how to approach exploring datasets. Comment if you need it.
Getting back on the topic. The data is available in kaggle for download. The file format is csv (comma seperated values). This is the general file format while working in data.
The following code loads the data in RStudio, and displays the structure of the data.

Data Description:

Here’s a brief version of what you’ll find in the data description file.

  • SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

Yeah I know it’s a long list. But don’t worry I did tell you that we won’t be doing data exploratory. Actually we won’t be doing exploring of any kind. We’ll simply let PCA deal with the hard work.

PCA? Read this article by Matt Brems. It covers the theoretical aspects of PCA. I’ll do the coding part.

Identify Anomalies/Missing Data:

The dimensions of the training data is 1460x81, and that of the test data is 1459x80. While going through the data in House Price (I didn’t exactly go through the data just wrote a bunch of nerdy code) I realised there were few missing data which is a great reminder of the fact that nothing is perfect in this world.
Missing Data can impact your analysis and machine learning model very drastically. For an introduction of how to handle missing value go through this article.

We see that there are 5 columns: “Alley”, “FireplaceQu”, “PoolQC”, “Fence”, “MiscFeature” which have more than 20% missing data.
Now the thing with missing values is that it is a good practice to impute them with reasonable values but if we explicitly impute values of our choice we may be manipulating the data to our preference and are bound to get wrong model. We should resort to removing the columns containing missing values more than a certain threshold. I keep it till 20%.

Now that we’ve removed the columns containing more than 20% of the missing values it is time to impute values for the other columns.

Also it was noted that Utilities in train and test data set had a very unique feature. Apparently Utilities in train data set had 2 values: AllPub and NoSeWa. But when one looks in the uniqueness of the Utilities in test data we find: AllPub.

I’ll be taking the help of mice package to impute the missing values using random forest method.

Data Preparation:

One Hot Encoding:

Now that we’ve handled the missing values we will merge the two datasets to create dummy variables.
Why you say? Well you see this is after all a regression problem and if we don’t have continuous or discreet values in the data set it would create a bit of a problem. And we don’t need to complicate stuffs, do we?

Now that the data sets are merged let’s create dummy variables. It ain’t that difficult in R. We will use the dummyVars method of the caret package.

Now separating the training and testing data sets after creating the dummy variables.

PCA:

Hopefully you have a general idea on how PCA works. Even if you don’t, look up here for a theoretical aspect of PCA.

Let’s plot to understand how many variables we ought to take for the model creation.

We see that the first 150 principal components account for more than 80% of the variance.
Subsetting the first 150 variables in a data frame.

caret Package overview:

One of the biggest challenge beginners in machine learning face is which algorithms to learn and focus on. In case of R, the problem gets accentuated by the fact that various algorithms would have different syntax, different parameters to tune and different requirements on the data format. This could be too much for a beginner.

So, then how do you transform from a beginner to a data scientist building hundreds of models and stacking them together? There certainly isn’t any shortcut but what I’ll tell you today will make you capable of applying hundreds of machine learning models without having to:

  • remember the different package names for each algorithm.
  • syntax of applying each algorithm.
  • parameters to tune for each algorithm.

All this has been made possible by the years of effort that have gone behind CARET ( Classification And Regression Training) which is possibly the biggest project in R. This package alone is all you need to know for solve almost any supervised machine learning problem. It provides a uniform interface to several machine learning algorithms and standardizes various other tasks such as Data splitting, Pre-processing, Feature selection, Variable importance estimation, etc.

Follow this article to get a good overview on how caret package works.

Now that we’ve completed with the data crunching part let’s focus our attention on creating training models for predicting the test data set.

The best thing about caret package is the number of algorithms it allows us to use more than 175 algorithms in one single package. Well talk about overkill.
Click here to get to know the algorithms that caret package follows.

And if you want to know more details like the hyperparameters and if it can be used of regression or classification problem, then do a

train(): This function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model and calculates a resampling based performance measure.
Besides building the model train() does multiple other things like:

1. Cross validating the model

2. Tune the hyper parameters for optimal model performance

3. Choose the optimal model based on a given evaluation metric

4. Preprocess the predictors (what we did so far using preProcess())

trainControl(): The train() function takes a trControl argument that accepts the output of trainControl().

Inside trainControl() you can control how the train() will:

1. Cross validation method to use.

2. How the results should be summarised using a summary function

Cross validation method can be one amongst:

‘boot’: Bootstrap sampling

‘boot632’: Bootstrap sampling with 63.2% bias correction applied

‘optimism_boot’: The optimism bootstrap estimator

‘boot_all’: All boot methods.

‘cv’: k-Fold cross validation

‘repeatedcv’: Repeated k-Fold cross validation

‘oob’: Out of Bag cross validation

‘LOOCV’: Leave one out cross validation

‘LGOCV’: Leave group out cross validation

The summaryFunction can be twoClassSummary if Y is binary class or multiClassSummary if the Y has more than 2 categories.

By setting the classProbs=T the probability scores are generated instead of directly predicting the class based on a predetermined cutoff of 0.5.

Training Model:

Creating the trControl parameters on how our algorithm should work.
We’ll be using repeatedCV. CV stands for cross validation technique.
RepeatedCV states the number of times we will be repeating the cross validation. Think of it as a for loop to understand the data well.

Pretty simple, eh? Well as you go deeper into the caret package you will need more parameters to help you out. But for now we’ll stick to the basics.

Let’s create different models for different algorithms:

Be warned this will take a lot of time to compute. Like a lot of time. Especially Random Forest.
You will observe that all the algorithms has almost same type of syntax. Well that’s the perk of caret package. You don’t need to remember all the parameters of different packages.
Although that’s also the downfall of caret package as in order to tune the model according to your specific needs it is not that helpful as it doesn’t take into account all the parameters.

Predicting the test data sets:

Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features (the model is never allowed to see the test answers).

And voila! we have the output for the test data sets.
Now we need to save this predicted data in an organized form so as to submit it in kaggle.

Now all you have to do is submit the created csv files in this link.

Conclusion:

With these codes we have come to the ending part of this article. At this point if you want to improve the code, we could try hyperparameter tuning on a different sets of algorithms. And perhaps more of the exploring the data sets to get a general idea on the variables.
Also more data available better it is for predicting the data. It would be encouraged if anyone could improve the performance of the model without using different algorithms but rather by crunching the data.
For those who are genuinely interested in learning the ins and outs of caret package I highly recommend this article which I stumbled upon and has helped me immensely.
Moreover, I hope everyone who made it through has seen how accessible machine learning has become and is ready to join the welcoming and helpful machine learning community.

As always, I welcome feedback and constructive criticism! My email is mervyn.akash10@gmail.com .

Happy coding!!

--

--