Python K-Means Data Clustering and finding of the best K
A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. K-Means is one of the most popular “clustering” algorithms. K-means stores k
centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.
The k-means algorithm takes a dataset X
of N
points as input, together with a parameter K specifying how many clusters to create. The output is a set of K
cluster centroids and a labeling of X
that assigns each of the points in X
to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.
The mathematical condition for the K clusters and the K centroids can be expressed as:
Minimize
with respect to
The Algorithm
In the clustering problem, we are given a training set x(1),…,x(m), and want to group the data into a few cohesive “clusters.” Here, we are given feature vectors for each data point x(i)∈ℝn as usual; but no labels y(i) (making this an unsupervised learning problem). Our goal is to predict k centroids and a label c(i) for each data point. The k-means clustering algorithm is as follows:
Problem Statement:
Download data sets A and B. Both have 200 data points, each in 6 dimensions, can be thought of as data matrices in R 200 x 6 . For each, run some algorithm to construct the k-means clustering of them. Diagnose how many clusters you think each data set should have by finding the solution for k equal to 1, 2, 3, . . . , 10.
Python Implementation
We use Pandas and SKLearn (scikits-learn) to implement our solution.
What’s the best value of K?
So what is the right value of k? Like with PCA, there is no perfect answer towards choosing how many dimensions the subspace should be. When k is not given to you, typically, you would run with many different values of k. Then create a plot of cost(S, X) as a function of k.
This cost will always decrease with larger k; but of course k = n is of no use. At some point, the cost will not decrease much between values (this implies that probably two centers are used in the same grouping of data, so the squared distance to either is similar). Then this is a good place to choose k.