As you know that the optimization problem in k-means clustering is to minimize the equation given below. Here, “k” is the total number of clusters that we want to find and the problem is to minimize the intra cluster distance(squared). This equation says that we want to find the centroids “Ci’s” which minimizes the distance between the data point belonging to a cluster and its centroid.
Now, to convert it into matrix factorization problem, I have to define a matrix “Z”, such that in this a value will be equal to one if that data point belongs to a particular cluster otherwise zero. Here, each row represents a cluster and each column represents a data point. So,”Z[i][j]” will be equal to one if jth data point belongs to ith cluster.
Now we can modify our optimization problem as follows:
Here, “Z[i][j]” will be one only for those data points that belong to that particular cluster. For Example, when “K” is equal to one “Z[1][j]” will be one only for those data points that belongs to cluster one, rest it will be zero. So, we are minimizing the intra cluster distances similar to normal k-means optimization problem.
Now, in order to convert this formulation into matrix factorization form, I will write this equation into matrix form as follows:
This is the matrix representation of the previous vectored form. It is similar to the Non Negative Matrix Factorization(NMF/NNMF) equation where matrix “A” can be seen as a product of two matrices “B” and “C”. Here also, I can write matrix “X” as a multiplication of matrices “C” and “Z”.
Here, “X” is my data matrix which represents the data points in d-dimensions, where I have total “n” data points. Matrix “C” is the centroid matrix where each row represents a centroid. It is a “k x d” matrix as I have taken “k” centroids each of d-dimensions. Matrix “Z” is the same matrix that I defined earlier.
I can use Gradient Descent to find the optimal “C” and “Z” using the above equation(Optimization problem/Matrix Factorization form) as my loss function.
You can easily understand it from the example of one data point. Lets say I take first data point “X1” which will be first row of “X” data matrix of dimensions ‘1 x d’. I want my centroid to be very close to “X1” to reduce the intra cluster distance. We will randomly initialize both “C” and “Z” matrices. So, I will take first row of “Z.t” matrix(because it corresponds to first data point), it will be a one hot encoded vector(i.e., a vector with only one value to be ‘1’ and rest all zeros) as one data point can only belong to a single cluster(each row representing a data point and each column representing a cluster).
When I will do the dot product between “Z.t” and “C”, I will get a vector of size ‘1 x d’. Now, according to the optimization problem or loss function, I want the difference between data point “X1” and this centroid to be very small. So, gradient descent will learn the matrices “C” and “Z” by reducing the loss or intra cluster distances.