0xCODE
Published in

0xCODE

SVM Classification Algorithms In R

Support Vector Networks or SVM (Support Vector Machine) are classification algorithms used in supervised learning to analyze labeled training data. SVM can classify features in a training set into categories that use either a linear or non-linear model. The linearity of the classifier is determined by the kernel function of the data set e.g. linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Therefore it is also possible to use non-linear classification in SVM using the kernel trick.

This example will use a theoretical sample dataset in RStudio. The dataset relates to people who have bought an SUV from social media ads based on their age and estimated salary. The following is a sample data that consists of 400 entries. Using SVM to classify those persons is the objective. The original file is called ‘Social_Networks_Ads.csv’ and contains 5 columns named User.ID, Gender, Age, EstimatedSalary and Purchased.

The objective is to classify those people by their age and salary who purchased the SUV from the social media ad. Using SVM will classify features into two, those who purchased the SUV and those who didn’t purchase the SUV.

We will import the dataset first. We are only interested in 3 of those columns, which are Age, EstimatedSalary and Purchased.

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

The next step requires encoding the features as a factor. This will represent the categorical data for plotting.

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Now we must split the dataset into a Training Set and Test Set. This requires loading the training tools library called caTools.

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Note that the split ratio is set to 0.75 which can be adjusted. The set.seed is a randomized function that provides random number starting at position 123. The split function is applied to the Purchased column flagging each line as TRUE or FALSE. The training_set takes rows that have a value of TRUE while the test_set takes rows that have a value of FALSE.

The next step is normalizing the features of the training and the test data. This requires feature scaling.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

The objective is to improve the predictive accuracy of the algorithm. After scaling the features, proceed to fitting the SVM classifier data to the training set.

library(e1071)
classifier = svm(formula = Purchased ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')

The library ‘e1071’ must be installed and loaded in the previous step. The next line runs the classifier on the training set and test set so that predictions can be made.

y_pred = predict(classifier, newdata = test_set[-3])
y_train_pred = predict(classifier, newdata = training_set[-3])

The prediction is defined in the variable y_pred and y_train_pred. From this the data of the test set results are predicted. In order to find how accurate the predictions were, run the confusion matrix.

cm = table(test_set[, 3], y_pred)
cm2 = table(training_set[, 3], y_train_pred )

The result of the matrix for y_pred (test set) is:

y_pred
0 1
0 57 7
1 13 23

For the matrix of y_train_pred (training set):

y_train_pred
0 1
0 183 10
1 36 71

The confusion matrix or CM is a summary of the prediction results. The results show that from 100 observations (57 and 23), there were a 20 incorrect predictions (13 and 7) in the matrix for y_pred. In y_train_pred we have 254 observations (183 and 71) with 46 incorrect predictions (36 and 10).

Note: Why did the cm2 result in only 254 observations when the training set contains 300 observations? This does happen because of the way R samples the data. In this case not all observations were included because if there are overlaps with observations, the feature is redundant so it was not included.

At this point, the Environment -> Data section in our IDE should look like the following.

Visualizing the dataset is the next part. This will create a plot that will show how the dataset was fitted in the training and test set.

This is the training set.

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Here is the result of the plot.

The hyperplane is the separation boundary of the two classifiers. This creates a way to classify the vectors or the features of the dataset. In this case there are observable green dots in the red region and red dots in the green region. Those are incorrect predictions made on the training set. The red would indicate those who did not purchase the SUV, while the green region classifies those who did purchase the SUV based on the social media ads.

Next, visualize the test set.

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

The following is the resulting plot.

The results could not minimize the incorrect predictions, so this model can be further refined using Kernel SVM. In the example, the type of kernel used was linear. Since the hyperplane is linear, the green dots in the red region could not be separated unless a non-linear boundary was used. The red region represents those who didn’t purchase the SUV while the green region represents those who did purchase the SUV. That is the task for further optimizing this model in order to get less errors to identify those who bought the SUV (should be in the green region) and those who didn’t buy the SUV (should be in the red region).

Note: The dataset is available from the SuperScience.com website.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vincent Tabora

Editor HD-PRO, DevOps Trusterras (Cybersecurity, Blockchain, Software Development, Engineering, Photography, Technology)