Some easy R-examples for support vector maschines
In this post I will explain the principles of support vector machines (SVM). It may be a bit simplified as SVM is a complex topic with some theory behind.
There is some hype nowadays about SVM as there is with many concepts of maschine learning. Basically SVM is a binary classifier, i.e. it sorts data points in two buckets. We can extend SVM to more buckets by running the algorithm several times on a one-against-one comparison and chose the best fit.
Nothing exciting so far. What is remarkable is how SVM handle non-linearity, i.e. by projecting in a higher-dimensional space where the classification-problem is actually linear.
So, how does SVM do the classification? Basically it identifies the hyperplane, i.e. a plane one dimension below the data space, that separated the two classes best. This means maximizing the distance to the points close to the separation border (that is where the name support vectors comes from). Now this works well for linear separable classes but this prerequisit is not very realistic. To overcome this problem the data space is transformed via some kernel function to a higher dimensional space where linear separability is possible
Enough theory, let the games begin. We will use the R-package e1071 that is an interface to libsvm, a C++ implementation of SVM. Let’s create a 3-dim sample data set, i.e. two numeric dimensions, the third one consisting of the two categories we want to classify.
Here is the R-code:
############################################
# Support vector maschines: Examples
############################################
library(e1071)
library(rpart)
n = 1000
testSize <- 0.33
#create data set
set.seed(1)
df <- data.frame(x=runif(n,-3,3),y=runif(n,-3,3))
#Example 1a: linear split in 1 dimension
df$Class <- as.factor(ifelse(df$x>1,”red”,”blue”))
#Example 1b: linear split in 2 dimensions
df$Class <- as.factor(ifelse(df$x+df$y>1,”red”,”blue”))
#Example 1c: polynomial split in 2 dimensions
df$Class <- as.factor(ifelse(df$x²+df$y²>1,”red”,”blue”))
## split data into a train and test set
index <- 1:nrow(df)
testIndex <- sample(index, trunc(n*testSize))
testSet <- df[testIndex,]
trainSet <- df[-testIndex,]
plot(df$x,df$y,col=as.character(df$Class))
# svm
svm.model <- svm(Class ~x+y, data = trainSet, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, testSet[,-10])
plot(testSet$x,testSet$y,col=as.character(svm.pred))
## compute svm confusion matrix
table(pred = svm.pred, true = testSet$Class)
sum(svm.pred==testSet$Class)/nrow(testSet)
Let’s look at the different examples:
Example 1a: linear split in 1 dimension
df$Class <- as.factor(ifelse(df$x>1,”red”,”blue”))
We see that the SVM is very good in this simple classification, the accuracy is ~99%
Example 1b: linear split in 2 dimensions
df$Class <- as.factor(ifelse(df$x+df$y>1,”red”,”blue”))
Example 1c: non-linear split in 2 dimensions (circle)
df$Class <- as.factor(ifelse(df$x²+df$y²>1,”red”,”blue”))
Example 2:
Now let’s get more complex and define a plane that splits the two groups
df$z <- 3*df$x³-2*df$y²-1
df$Class <- as.factor(ifelse(df$z>0,”red”,”blue”))
Here is the R-code for the 3D-plot
library(plot3D)
scatter3D(x = df$x,y = df$y,z = df$z,phi=20,theta=20,bty=”b2")
So this gives a little impression what SVMs are capable of. I hope to provide some realistic setting soon.