Machine Learning Classification Models (Part II)

Published in

Brandon Lammey Intro to AI

10 min readApr 3, 2019

In Machine Learning Classification Models (Part I) covering machine learning classification models which produce a prediction of a discrete label or category, I covered two algorithms: logistic regression and K-Nearest neighbor. I will now cover the remaining models describing their benefits and use cases. The functionality of these models tend to be more complicated than the previous two models discussed but thanks to the functions in R, we can simply use the same code format from Part I.

General Format (Template)

Import and Preprocess Data

‘
=== === ===
Data Preprocessing
=== === ===
‘
#Import Data
dataset = read_csv(‘medexpense_data.csv’)
attach(‘medexpense_data.csv’)#Change Smoke and Gender into binary values
SMOKER<-ifelse(dataset$smoker==”yes”, 1, 0)
GENDER<-ifelse(dataset$gender==”male”, 1, 0)
MODIFIED_DATASET <- data.frame(dataset$medical_expenses, dataset$bmi, dataset$age, GENDER, SMOKER)

2. Splitting Data in a training/ test set and Scaling

‘
=== === ===
Splitting Data and Feature Scaling
=== === ===
‘#Splitting Data into training and test data (25/75)
install.packages(‘caTools’)
library(caTools)set.seed(123)
split= sample.split(MODIFIED_DATASET$GENDER, SplitRatio = 0.75)
training_set = subset(MODIFIED_DATASET, split==TRUE)
test_set = subset(MODIFIED_DATASET, split==FALSE)# Feature Scaling
training_set[, 1:3] = scale(training_set[, 1:3])
test_set[, 1:3] = scale(test_set[, 1:3])

3. Training

‘
=== === ===
Training
=== === ===
‘
#Varies based on model

4. Visualization

'
=== === ===
Data Visualization
=== === ===
‘
#Create Confusion Matrix (Real VS Predicted)
conf_matrix = table(test_set[,5], dependent_pred)
conf_matrix
fourfoldplot(conf_matrix)

A sample of the data I will be working on appears in the following table:

Support Vector Machine (SVM)

Like the previous models, SVM’s used for classification will predict a probability. This algorithm uses what is called a hyperplane to split two categories or classes. This hyperplane can be visualized as a line which acts as a boundary between two different classes. The learning comes into play in determining this hyperplane which splits the variable space. SVM’s effectiveness comes into play when classifying based on points most dissimilar to the class they belong too. For example, comparing cats and dogs, SVM’s will use dogs most similar to cats to classify dogs and cats most similar to dogs to classify cats. In short it uses the most extreme cases for classification.

Method

The SVM line or center maximum margin hyperplane is determined using the maximum margin between a single point in each model called support vectors. In a 2D space it is a line but in multidimensional space it is called a hyperplane which separates a positive and negative category.

Example

Using the same template for the problem from Part I, I will simply change the training section for the R function required for training an SVM and then view the confusion matrix. For this model the e(1071) package will be required as well as for Kernel SVM and Naive Bayes.

‘
=== === ===
Training
=== === ===
‘
#SVM
#Fitting classifier to Training Set
install.packages(‘e1071’)
library(e1071)classifier = svm(formula = SMOKER ~ ., data = training_set, type = ‘C-classification’, kernel = ‘linear’)#Predict test set results
dependent_pred = predict(classifier, newdata = test_set[-5])

From the Confusion matrix we can see the accuracy of the model had ten incorrect predictions which going back to Part I is about as accurate as Logistic Regression and K-NN. However, this is somewhat expected since the data fits a linear pattern as is and therefore would not see much improvement from using another linear model such as SVM shown in the graph. The next model which is nonlinear, may be able to predict slightly better but should be noted works far better for nonlinear data as we I will show in the next section.

Kernel SVM

This model works similar to SVM with the main difference being cases in which a boundary isn’t easily found. Imagine the points from one class surrounding the points of the second class. SVM separates linearly separable data whereas Kernel SVM places a boundary for non linearly separated data using a higher dimensional space. This is effective for data sets which can not be linearly separated.

Method

Taking a non linear separable data set, this model maps the data to a higher dimensional space. This will give a linear separable data set which can then be separated with a decision boundary using SVM. The boundary is then projected back into the original dimensions. This is however highly compute-intensive and therefore the approach used will be what is called a “kernel trick”.

This trick allows for the selection of a landmark point which creates an area for the points in a select classification using a kernel function and anything outside of that area will fall into another classification. Therefore all computation will occur in the lower dimensional space. For more on this view The Kernel Cookbook .

There are a few different choices of functions to use for kernel. Choosing of Kernel Function will depend on the distribution of data used. Some common Kernel Functions are:

Gaussian RBF
Sigmoid
Polynomial

Example

The R code for Kernel SVM will look almost identical to SVM with the slight adjustment of changing the kernel parameter. In this example, i will be using Gaussian function and so this is set to radial for that case.

‘
=== === ===
Training
=== === ===
‘
#Kernel SVM
#Fitting Kernel SVM to Training Set
library(e1071)classifier= svm(formula = SMOKER ~ ., data = training_set, type = ‘C-classification’, kernel = ‘radial’)#Predict test set results
dependent_pred = predict(classifier, newdata = test_set[-5])

Using this nonlinear model as shown in the graph, I was able to attain slightly less error with nine incorrect predictions. This is not much better however due to the data already fitting a linear function. Though looking at the data points, we do see two central areas of green points separated by some red points. Selecting these two as landmark points as opposed to a single point and using a different Kernel Function for the model.

Naive Bayes

To begin, a very brief explanation of Bayes Theorem is the probability of an event, based on prior knowledge of conditions that might be related to the event or in regards to the formula below how often A happens given that B happens. This classification model, based on Bayes theorem, assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature (i.e. the features are conditionally independent given the target value) which may not be true but all of these features do independently contribute to the probability. The combination of the theorem used and the assumptions made (that these features do not depend on one another) is what give this model its name and it is effective in regards to very large data sets for binary (two-class) and multi-class classification.

Method

The steps of calculation for Naive Bayes requires calculating the Prior Probability, Marginal Likelihood, Likelihood, and Posterior Probability for each feature. Then these probabilities are compared to determine where to place a new data point. This is fairly different than the previous models which have attempted to find a function for a line to split the data.

Example

To implement the Naive Bayes algorithm in R, I use the naiveBayes() function in the e1071 library and set the necessary parameters. The naiveBayes() function requires the dependent variable SMOKER to be encoded as a classifier during the Data Preprocessing Step before splitting the data. Then the confusion matrix is outputted.

‘
=== === ===
Data Preprocessing
=== === ===
‘
#Follow template …
#Encoding the Target Feature as factor
MODIFIED_DATASET$SMOKER = factor(MODIFIED_DATASET$SMOKER, levels = c(0,1))‘
=== === ===
Training
=== === ===
‘
#Naive Bayes
#Fitting Kernel SVM to Training Set
library(e1071)classifier= naiveBayes(x = training_set[-5], y = training_set$SMOKER)#Predict test set results
dependent_pred = predict(classifier, newdata = test_set[-5])

As can be seen from the confusion matrix, there are thirty incorrect predictions which reveals that the non linear model is not a good fit for the data. This may be due to the small size of the data set and/ or the spread of the data.

Decision Tree Classification

A decision tree classification model has the structure of a binary tree in which data is broken down into smaller and smaller subsets. Each node in the tree represents an input variable. The tree structure itself is built by splitting the data points into sections making it effective when dealing with relationships that are difficult to split with a single line.

Method

At the most basic level, the decision tree model will split data up into several slices. Each split is made to maximize the number of points in a category in a zone or leaf. To see the math behind the selection of each split see Decision Tree — Classification.

Example

Using the template, the only change required is in the training section and using a new library, rpart, to create the classifier. Then I can output the confusion matrix.

‘
=== === ===
Training
=== === ===
‘
#Decision Tree
#Fitting Decision Tree to Training Set
install.packages(‘rpart’)
library(rpart)classifier = rpart(formula = SMOKER ~ ., data = training_set)#Predict test set results
dependent_pred = predict(classifier, newdata = test_set[-5], type = ‘class’)

The model still predicts at about the same error as the previous models with 10 incorrect predictions. Our next model may be able to improve this by using many trees as opposed to just a single one.

The tree can also be plotted to see the way the data was split.

#Plot Tree
plot(classifier)
text(classifier)

Random Forest Classification

Ensemble Learning is learning that is comprised of several supervised learning models which are trained individually which give results that are used to make the final prediction. Random Forest is this method of classification which leverages many decision trees by running many decision tree models to make a decision. For this reason, random forest produces much more accurate results than simply using Decision Tree since it can remove the effect of certain errors and uncertainties. However since it utilizes a team of trees, it is also slow at predicting.

Method

To build each decision tree which comprises the random forest, subsets of data are used and the forest is constructed using the following steps.

Randomly choose k data points from the Test Set
Build a tree associated with these points
Choose the number of N trees to build and then repeat 1 and 2
For a new data point, make each tree make a prediction on the category and assign the point to the category which wins the majority vote

Example

In R, the randomForest library is used to create the Random Forest Classifier. Again, the only change made, aside from importing this library, is in the training section using randomForest() to create the classifier. The amount of trees selected is 25. This can be modified but be aware of overfitting data. Then the confusion matrix can be outputted.

‘
=== === ===
Training
=== === ===
‘
#Random Forest
#Fitting Decision Tree to Training Set
install.packages(‘randomForest’)
library(randomForest)classifier = randomForest(x = training_set[-5], y = training_set$SMOKER, ntree = 25)#Predict test set results
dependent_pred = predict(classifier, newdata = test_set[-5])

The number of incorrect guesses equating to 8 verifies the model is more accurate than any of the previous algorithms that I have looked at by a small margin. However, it should be noted that viewing the graph of the data with the model will show signs of any possible over fitting which may result for selecting various amounts of trees to be used and the error may change based on the selected number of trees.

Final Remarks

Though all models gave a similar small amount of incorrect predictions with accuracy around 96%, this will not be the case for all datasets which will have varying amount of independent variables and may fit better in different curves. This is why it is important to use a model which will best model the data set you are using and validate with k-fold cross validation. Then improve with parameter tuning.

Cumulative Accuracy Profile

To measure the accuracy of the models used, I used the Confusion Matrix of each model. However, to avoid the accuracy paradox, models can be assessed with the CAP method to determine model accuracy.