Pattern Representation
The Kernel Method: Solving Linear Inseparability
One of the requirements for linear machine learning algorithms, such as the perceptron algorithm or linear SVM, is that the data is linearly separable. But what happens if our data violates this assumption, do we have to forgo the benefits of a linear learner algorithm in favour of something more complex? Well hold on a second, we can try to use the kernel method!
The idea behind the kernel trick is we take data that is not linearly separable and map it onto a higher dimensional space using a non-linear function. Then, we train the linear machine learning algorithm in the higher dimensional space. In order to classify the data, we would map it to the higher space, and classy based upon the higher dimensional classifier. It is a lot easier to visualize the process, so let’s generate some data to see what’s going on.
I have generated some mock data in Python that is obviously not linearly separable. However, there is a clear pattern separating the two classes which gives us an indication that we could create a function that splits the data in a higher dimension. So how about instead of trying to classify this data in two dimensions, we map the data to a three-dimensional space, and try to classify there.
When choosing a mapping function we should look at the feature space to help us determine what our function should be. Some things to look for are whether the data is discrete, non-linear or if there are any underlying patterns. In our case, the data is clearly separated by a circular boundary, so we can define a function that will assign large z-values if the data is outside the circle with radius 1 and centre (2,2), and assign small z-values if the data is within the circle. The specific mapping function that we will use is the following:
Which causes the data to look like this in 3 dimensional space when mapped:
We can now apply the linear SVM algorithm to this data, which results in the following hyperplane.
Now if we were given a new point to classify, we would first need to map the point to 3 dimensions via the mapping function, and then classify based on which side of the hyperplane it lies on.
There are drawbacks to the kernel method approach, such as the method being quite resource-intensive when compared to non-linear machine learning algorithms. Another downside is that an improper choice in mapping function can lead to overfitting and an inaccurate model. Lastly, the model can be difficult to interpret as we have to move away from the original feature space.
However, the approach is worth trying if you are working with non-linear data, but are limited to linear machine learning algorithms.
All code is available in this github repo.
Zackary Nay is a full-time software developer working within the construction industry, where he implements artificial intelligence and other software to expedite repetitive tasks. In his free time, he likes to read and go backpacking.