Kernels

Arnab
Analytics Vidhya
Published in
4 min readJul 21, 2021

What might seem like an outdated idea, is an important step into understanding machine learning and the math that governs it.

Feature Maps:

Let y denote two classes (denoted by red and blue blobs) and our data have two attributes X¹ & X².

Linearly separable

Now, we may have a case where our data is linearly separable and we can clearly draw a line separating both classes.

But, we might also encounter situations where that is just not possible.

Not possible to separate linearly

In this case there is no line that can be drawn that can separate these two classes.

In such a case we map our input to a new ‘feature space’ so that we can use a higher dimensional hyperplane to separate our classes.

We define a function that takes our input points from R² -> R

After applying this transformation our data looks something like this.

Now, we can draw a plane separating the data. Intiutively, we can see how this works. The only difference is in the 3rd dimension. The outer circle has a larger values of X¹ and X² leading to a lower value across the z-axis.

Kernels:

Usually, wherever we require a product of two input variables, and we have transformed the input space into a feature map, we need to compute the dot product of the feature map. These are usually pretty computationally expensive to compute or to store, this is where the “kernel trick” comes in.

For instance in SVM,

Optimization problem for SVM [Reference]

Once we have transformed Xi and Xj into Phi(Xi) & Phi(Xj), we need the efficient way of computing the dot product.

Optimization problem for SVM [Reference]

Let’s take an example of a very common kernel:

Polynomial Kernel:

2nd order polynomial kernel

So all this kernel is doing is taking a dot product between between two vectors of dimension d & squaring it. This would take O(d) time.

More concretely, let’s take an example where d=2.

What would be K(X,Z) ?

Expanding,

Now if we have a feature map like below

From this we can clearly see that the dot product of the above two is the same as the output of the kernel function

Dot product of our feature maps equals our kernel

If we were to calculate Phi(X) & Phi(Z) without the kernel we would have to spend O(d²) for the transformation & the subsequent dot product.

Transforming our original circles data with our polynomial kernel (an approximation in 3D)

Without the “kernel trick” the output time for a 1000 dimensional matrix with 10 training examples comes out to be ~5.5s

With the kernel trick time time comes around to be ~0.05s for the same size of matrix.

Main takeaway: Whenever we have a dot product of individual input attribute vectors we can use kernels to make our computation more efficient.

Gram Matrix:

Practically, if we have N training samples we form a Gram matrix (Matrix with the computed kernel value for a pair of samples) as follows:

N training samples

The ith row and the jth column of the matrix would be as follows:

Conditions for valid Kernels:

For a kernel to be valid, the constructed matrix must be symmetric ( this should be fairly obvious, since K[j,i] = K[i,j] because the dot product would remain the same ). Additionally, they must be positive semi-definite (eigen values are non negative). We can show this using the “energy-test” where for any vector x,

--

--