Understanding K-Means Clustering and Kernel Methods

Meltem Tutar
Udemy Tech Blog

--

Clustering is a machine learning technique to identify groupings of similar data points. It is a valuable tool when you are interested in how data points relate to one another or discovering the latent factors underlying a data set. At Udemy, we’ve utilized clustering techniques to identify similar lectures across thousands of courses to create lecture recommendations. We lacked labeled data, so unsupervised clustering methods were a great start to our problem.

As there are many clustering methods, we researched which method is best for our use case. K-Means is one of the most widely used and fundamental unsupervised algorithms. It also has connections to other clustering algorithms. For example, the vectorized K-Means objective function is used to show how non-negative matrix factorization (NMF) can be utilized for clustering. Additionally, Kernel K-Means has equivalency to spectral clustering for a particular set of parameter values. We believe it’s important to understand K-Means and its drawbacks as a starting point for our exploration.

Hence, this article explains more about K-Means and Kernel K-Means, an extension of K-Means that performs better when clusters have non-linear boundaries. Kernel methods are used to achieve linear separability of data in many different machine learning algorithms, the best-known example being support vector machines. The application of kernels to K-Means is less discussed, and we will go into great detail on this subject.

Whether you are aiming to apply Kernel K-Means directly or understand K-Means or Kernel methods as separate topics, this article will be helpful for you. We will be outlining the following:

  • The intuition of K-Means and some of its disadvantages
  • Vectorization of the K-means objective to apply kernel methods
  • Origin of kernel methods and their utility
  • Descriptions of Python packages to apply Kernel K-Means and similar algorithms

We include key takeaways at the end.

Overview of K-Means

K-Means aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid). In other words, it minimizes within-cluster dissimilarity using the following objective function:

Figure 1: K-Means Objective Function, which partitions N observations into K clusters to minimize within-cluster dissimilarity. C represents each cluster, 1 through K, and x represents data points, each a vector of dimension M.
Figure 2: Definition of cluster centroids uᵢ, the average of all data points assigned to cluster i

The objective function is minimized using an iterative approach that converges to a locally optimal solution. We alternate between (a) treating the cluster centroids (uᵢ’s) as fixed and computing cluster assignments (Cᵢ’s) and (b) treating the cluster assignments as fixed and computing cluster centroids. This article talks about the implementation of K-Means in more depth.

Disadvantages of K-Means

There are a variety of conditions where K-Means may result in incorrect clustering.

K-Means may not perform well when:

  • Clusters are non-spherical
  • Clusters have different sizes
  • Data has outliers
  • Clusters are non-linearly separable
  • Clusters have overlap
  • Cluster centroids have poor initialization

In this article, we will focus on the condition of the non-linearly separable clusters. However, Wikipedia has a great discussion for the other conditions as well.

Example of Non-Linearly Separable Clusters

Figure 3: Example clustering when data is non-linearly separable.

See this Google Colab for the generation of data and fitting of K-Means to generate this plot. Feel free to make a copy and play around with it.

Vectorization of K-Means Objective Function

Vectorizing the K-Means objective will be necessary to see how kernel methods can be applied to K-Means. This re-formulation is more rigorously shown in this survey by Turkmen¹. The following is a brief example to demonstrate its use.

Variables and matrix definitions with examples

Let us define variables and matrices to be utilized to vectorize the objective function of K-Means.

Variables:

  • x is the input data point, a vector of dimension M
  • N be the number of data points
  • K is the number of clusters

Matrix Definitions:

  • Matrix X is the input data points arranged as the columns, dimension MxN
  • Matrix B is the cluster assignments of each data point, dimension NxK
  • Matrix D is the number of data points assigned to each cluster, inverted and placed along the diagonal, dimension KxK

The following are examples of these matrices for a use case with N=100 data points and K=4 clusters.

Matrix B

B is a matrix where the rows are one-hot encoded cluster assignments for each data point of the X matrix. In hard clustering, each row can only have one column with a 1, indicating the data point is assigned to that cluster.

Figure 4: First few rows from example matrix B of cluster assignments, dimensions NxK

Matrix D

D is the number of data points assigned to each cluster, inverted and placed along the diagonal. Notice that the diagonal elements inverted and summed are equal to 100 (N, number of data points), and the dimension of the matrix is 4x4 (KxK, number of clusters). Also, notice that D can be derived completely from B, where D = inverse of BᵀB.

Figure 5: Matrix D defined as the number of elements in each cluster inverted and placed on the diagonal
Figure 6: Example matrix D, dimensions KxK

Matrix X

Matrix X is the input data points arranged as columns. Each data point (xᵢ) is a vector of dimension M.

Figure 7: Example matrix X, dimension MxN

The matrix product XBD then represents the matrix of cluster centroids, having dimension M (vector dimension) x K (number of clusters).

With all the above formulations, you can see matrix multiplying X and B gives a new matrix, where each column is the sum of the vectors in each cluster. Then, multiplying with D gives each cluster centroid on the columns, where the cluster centroid is defined as in Figure 2 (sum of data points in each cluster /number of data points in the cluster). In other words, XBD yields a matrix of dimension MxK and can be defined as,

Figure 8: Matrix with the centroid of each cluster on the columns, dimension MxK

Multiplying the XBD matrix by Bᵀ gives a matrix with the same dimension as X, where each column is the respective centroid of the cluster the data point is assigned to.

Figure 9: Matrix where each column corresponds to the cluster centroid of the assigned cluster. For example, if data point 1 and 2 are assigned to clusters 1 and 3, then the first and second column of the XBDBᵀ matrix will be the cluster centroid of cluster 1 and cluster 3, respectively. And so this pattern continues for all the data points, yielding a matrix the same dimension as X (M by N).

Put it all together to get the vectorized objective of K-Means!

The following matrix expressed in Figure 9 has the difference between X and the respective cluster centroid that X belongs to on each column. Hence minimizing the Frobenius norm is equivalent to minimizing the K-Means objective we had presented in Figure 1.

Figure 9: Vectorized objective function of K-Means

Notice that matrix D is fully derived from matrix B. In other words, the only variable in the vectorized objective is the assignment matrix B. X is the input data. D = inverse of BTB.

You can use trace operators to show that minimizing the objective function in Figure 9 is equivalent to maximizing the following function in Figure 10. Page 5 of this paper by Turkmen¹ shows the derivations for the equivalency of this.

Figure 10: Vectorized objective function of K-Means with trace operators applied

Kernel methods for K-Means

Kernel methods can be applied when the objective function can be written as a function of dot products. We can see this is the case after re-formulating the K-Means objective function to have the term XᵀX as in Figure 10. This allows us to map our current feature vectors into higher dimensional spaces in a more computationally efficient manner using the kernel trick.

Why do we need higher dimension feature vectors in K-Means?

As mentioned, K-Means performs best when clusters are spherical, dense, and linearly separable, like in the image on the right in Figure 13. For non-linearly separable clusters, like in the image on the left in Figure 12, we often would like to project our data into a higher dimensional space to make the resulting clusters linearly separable. Projecting to a higher dimension directly and calculating the objective for K-Means can be computationally expensive, as this requires calculating XᵀX. Kernel trick allows us to project our data into a higher dimensional space to achieve linear separability and solve the K-Means problem in a more efficient way.

Figure 13: Example data points for clustering. Where the left is non-linearly separable, and the right is linearly separable
Visualization to understand how projecting to higher dimensions leads to linear separability

The mechanics of kernel methods

A kernel function corresponds to an inner product of vectors in a certain space. It can be thought of as a similarity function over pairs of data points in this space. It is important to keep in mind that not all vector spaces have a corresponding kernel function.

Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the “kernel trick.”²

In other words, kernel methods allow us to get the result of XᵀX where X may be in a higher dimension without actually calculating X in that higher dimension. Hence anytime X exists in an objective function only in the form of XᵀX (like in figure 10), you can apply kernel tricks to easily expand the feature set to higher dimensions, even to infinity!

A kernel trick example

Let’s show a simple example to make this more concrete. Assume that we have vector xᵢ and we’d like to project xᵢ into a higher dimensional vector space using f(x) as shown below:

Figure 11: Mapping xᵢ into higher dimensional vector space using f(x) function

In our example let’s say xᵢ = [1, 2, 3]

Figure 12: Example for x ᵢand what the mapping in the given dimensional vector space would look like

Let’s say we would like to calculate f(xᵢ)*f(xᵢ), the dot product in the higher dimensional vector space for xᵢ with itself. You have two options:

Option 1: calculate x in the higher dimension and then take the dot product

f(xᵢ) * f(xᵢ)= [1, 2, 3, 2, 4, 6, 3, 6, 9] [1, 2, 3, 2, 4, 6, 3, 6, 9]ᵀ =1 + 4 + 9 + 4 + 16 + 36 + 9 + 36 + 81 = 196

Option 2: apply the kernel trick on x in the lower dimension

The kernel function for projecting to f(x) vector space is K(x, y) = f(x) * f(y) = (xᵀy)², assuming vectors are column wise. For calculating f(xᵢ) * f(xᵢ), it is as following:

f(xᵢ) * f(xᵢ) = K(xᵢ, xᵢ) = (xᵢᵀ xᵢ)² = ([1, 2, 3][1, 2, 3]ᵀ)² = (1+ 4 + 9)² =(14)² = 196

To apply the kernel trick you need f(x) * f(y), the dot product in the higher space for all combinations of data points because it appears in the objective function of K-Means (remember the XᵀX). The matrix storing these dot products is termed the kernel matrix and has the dimension NxN.

This was a straightforward example where the higher dimension we projected x into wasn’t that large. But using the kernel trick you could even formulate projections to infinity.

Some extra things to note:

  • Not every projection to a higher space has an associated kernel function. You can only apply this method when there are kernel functions available. Luckily there are many already developed, see popular kernels.
  • While kernel methods are computationally cheaper than explicit feature mapping, you still have to compute the NxN kernel matrix, which may be computationally intensive if N (number of data points) is large.
  • Using kernel K-Means it’s not easy to compute the cluster centroids like in regular K-Means. XBD yields the cluster centroid matrix and this is not easy to compute if X is projected to a very high dimension.

Kernel trick example was inspired from this article³.

Packages for Using Kernel K-Means

There is code developed for kernel K-Means, but as of the writing of this article, the code hasn’t been merged yet into Sklearn. Discussions are ongoing but you are free to use this code if you would like. This code also allows you to use a weighted version of Kernel K-Means, which can assign weights to data points in the objective function.

Other packages reference this kernel k-means code. For example, tslearn package, a package for performing time series analysis in Python. See here for an example clustering of time series data using kernel K-Means via tslearn package.

Figure 14: Example Kernel K-Means Clustering from using tslearn package on time series data in Python. The Plot is from here.

There is a theoretical connection between kernel K-Means and spectral clustering, which Dhillon et al.⁴ examine. They show how spectral clustering is equivalent to a relaxed version of kernel K-Means for a specific kernel and weights. Hence, you may also opt to experiment with using spectral clustering package available in sklearn.

As mentioned before, while kernel methods are computationally cheaper than explicit feature mapping, they may still be computationally intensive as you need to compute the NxN kernel matrix. In scenarios with many data points, using graph-based methods, like spectral clustering, may be a better option.

Key Takeaways

  1. Kernel methods expand the feature set to higher dimensions and learn non-linear boundaries.
  2. The K-Means objective function can be vectorized to have the XᵀX term, allowing kernel methods to be applied.
  3. For cases where clusters are non-linearly separable, Kernel K-Means can allow for more accurate clusters.
  4. Kernel K-means code can be found here, but it is not yet merged to sklearn. Tslearn package has a Kernel K-means clustering option. Spectral clustering, equivalent to Kernel K-Means for a certain parameter set, is available in sklearn.

That’s it for now! I hope this article was helpful as an introduction to K-Means and Kernel methods. Feel free to post any questions in the comment section.

I want to thank my wonderful colleagues at Udemy: Sam Cohan, Chuong Do, Akshay Kamath, Gulsen Kutluoglu, and Austin Wang, for reviewing and providing valuable feedback for this article.

References

  1. Ali Caner Turkmen, https://arxiv.org/pdf/1507.03194.pdf

2. Wikipedia on kernel methods, https://en.wikipedia.org/wiki/Kernel_method

3. Xavier Bourret Sicotte, https://xavierbourretsicotte.github.io/Kernel_feature_map.html

4. Dhillon et al., https://www.cs.utexas.edu/~inderjit/public_papers/kdd_spectral_kernelkmeans.pdf

--

--