Support Vector Machines | Machine Learning for High Schoolers

4 min readNov 3, 2021

This blog post covers lectures 5, 6, and 7 from the Stanford Machine Learning Course http://cs229.stanford.edu/syllabus-spring2020.html.

The documents being used are the following:

http://cs229.stanford.edu/notes2020spring/cs229-notes2.pdf

http://cs229.stanford.edu/notes2019fall/cs229-notes3.pdf

Discriminative vs Generative Algorithms

Suppose you were building an algorithm to differentiate between toads and frogs. Is it better to try to find a distinguishing line between toads and frogs? Or is it better to see whether a new amphibian is closer to what we know frogs to be or what we know toads to be? Both these approaches are valid in machine learning.

Discriminative algorithms try to distinguish between different things, and generative algorithms try to match a new data point to known examples of each possible categorization.

In most cases, discriminative algorithms are better, as they have a lower asymptotic error, but generative algorithms converge faster to their asymptotic error, so with larger data sets, they can be preferable. Generative algorithms are also more preferable when the data is not linearly separable, meaning a border cannot be easily drawn between the two classes.

The linear regression algorithms in the previous blog post are examples of discriminative algorithms, and an example of a generative algorithm is the naïve Bayes algorithm which we will look at today.

Gaussian Discriminant Analysis

Gaussian discriminant analysis is a generative algorithm. This algorithm assumes that the probability of the feature vectors (think of these as the data points or “measurements”) of frogs and of toads are distributed Gaussian. Here’s an example (not based on real data) assuming there’s only one feature.

As with our other algorithms like LMS (least mean squares), we need to find the values of the parameters which maximize the log likelihood.

Naïve Bayes Algorithm

The naïve Bayes algorithm is used in natural language processing, or teaching machines to speak human language. A common use is classifying emails as spam or not spam. The algorithm basically takes all the words in the email and stores each email as a vector the length of the entire dictionary with 1s for words in the email and 0s for words not present. Then, using Bayes rule with probabilities that can be calculated from the data, the algorithm can predict whether or not a new email is spam.

A great video that explains the naïve Bayes algorithm is this one: https://www.youtube.com/watch?v=O2L2Uv9pdDA

The naïve Bayes algorithm makes a lot of assumptions but still manages to be surprisingly accurate, which makes me wonder if human language is as complicated as we think it is.

Support Vector Machines

The main idea of a support vector machine is to take relatively low dimensional data and transform it into high dimensional data to then find a linear separation between different classes of data. This is used when the original data is not linearly separable.

Here is an example to illustrate this:

This is some data, x. The red points represent one class and the blue points represent another class. However, there is currently not a way to separate this data.

*This is that same data but plotted as points (x, x²). There now seems to be a way to draw a line separating the red points and the blue points.*

In drawing a line to separate the data, we need only look at the “edge” points of each class (i.e. -3.5, -2.5, 1.75, and 3) and draw a line maximizing the boundary between the classes based on these points. These points are called support vectors.

How do we decide how to transform the data? What set of features do we choose? This is where kernels come in.

Kernels

Kernels are used to determine the dimensions and to simplify the storage. My understanding of kernels is that since the “distances” between the points when transformed into higher dimensions can be calculated using a dot product of known values, this results in a single number rather than having to store calculated coordinates of higher dimensional data. All necessary calculations can be done using only these numbers, so it simplifies storage. When choosing a kernel function, it is important to choose a simpler function first to avoid complications with overfitting and computation.

Here’s more on support vector machines and kernels:

https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200