K-means Clustering as a Matrix Factorization problem

Jayant
4 min readSep 29, 2019

--

Pic credits to Mohit Khera
Pic Credits to Mohit Khera

As you know that the optimization problem in k-means clustering is to minimize the equation given below. Here, “k” is the total number of clusters that we want to find and the problem is to minimize the intra cluster distance(squared). This equation says that we want to find the centroids “Ci’s” which minimizes the distance between the data point belonging to a cluster and its centroid.

Conventional k-means optimization problem
k-means optimization problem

Now, to convert it into matrix factorization problem, I have to define a matrix “Z”, such that in this a value will be equal to one if that data point belongs to a particular cluster otherwise zero. Here, each row represents a cluster and each column represents a data point. So,”Z[i][j]” will be equal to one if jth data point belongs to ith cluster.

Data Matrix “Z” values
Data Matrix “Z” values
Data Matrix “Z”
Data Matrix “Z”

Now we can modify our optimization problem as follows:

Modified optimization problem
Modified optimization problem

Here, “Z[i][j]” will be one only for those data points that belong to that particular cluster. For Example, when “K” is equal to one “Z[1][j]” will be one only for those data points that belongs to cluster one, rest it will be zero. So, we are minimizing the intra cluster distances similar to normal k-means optimization problem.

Now, in order to convert this formulation into matrix factorization form, I will write this equation into matrix form as follows:

Optimization problem in Matrix Factorization form
Optimization problem in Matrix Factorization form

This is the matrix representation of the previous vectored form. It is similar to the Non Negative Matrix Factorization(NMF/NNMF) equation where matrix “A” can be seen as a product of two matrices “B” and “C”. Here also, I can write matrix “X” as a multiplication of matrices “C” and “Z”.

Non-Negative Matrix Factorization Equation
Non-Negative Matrix Factorization Equation
Matrix Factorization form for clustering

Here, “X” is my data matrix which represents the data points in d-dimensions, where I have total “n” data points. Matrix “C” is the centroid matrix where each row represents a centroid. It is a “k x d” matrix as I have taken “k” centroids each of d-dimensions. Matrix “Z” is the same matrix that I defined earlier.

Input Data Matrix “X”
Input Data Matrix “X”
Centroid Matrix “C”
Centroid Matrix “C”
Relationship Matrix between data points and Centroids
Relationship Matrix between data points and Centroids

I can use Gradient Descent to find the optimal “C” and “Z” using the above equation(Optimization problem/Matrix Factorization form) as my loss function.

You can easily understand it from the example of one data point. Lets say I take first data point “X1” which will be first row of “X” data matrix of dimensions ‘1 x d’. I want my centroid to be very close to “X1” to reduce the intra cluster distance. We will randomly initialize both “C” and “Z” matrices. So, I will take first row of “Z.t” matrix(because it corresponds to first data point), it will be a one hot encoded vector(i.e., a vector with only one value to be ‘1’ and rest all zeros) as one data point can only belong to a single cluster(each row representing a data point and each column representing a cluster).

Optimization problem for one data point
Optimization problem for one data point

When I will do the dot product between “Z.t” and “C”, I will get a vector of size ‘1 x d’. Now, according to the optimization problem or loss function, I want the difference between data point “X1” and this centroid to be very small. So, gradient descent will learn the matrices “C” and “Z” by reducing the loss or intra cluster distances.

--

--

Jayant

Learning to become a Data Scientist. Try to get descent depth of the subject. Meditation is my daily routine. Key area of interest is Machine Learning with IoT.