Python K-Means Data Clustering and finding of the best K

Scientific Programming School
3 min readJul 23, 2017

--

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. K-Means is one of the most popular “clustering” algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of Xthat assigns each of the points in Xto a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters and the K centroids can be expressed as:

Minimize

with respect to

The Algorithm

In the clustering problem, we are given a training set x(1),…,x(m), and want to group the data into a few cohesive “clusters.” Here, we are given feature vectors for each data point x(i)∈ℝn as usual; but no labels y(i) (making this an unsupervised learning problem). Our goal is to predict k centroids and a label c(i) for each data point. The k-means clustering algorithm is as follows:

Problem Statement:

Download data sets A and B. Both have 200 data points, each in 6 dimensions, can be thought of as data matrices in R 200 x 6 . For each, run some algorithm to construct the k-means clustering of them. Diagnose how many clusters you think each data set should have by finding the solution for k equal to 1, 2, 3, . . . , 10.

Python Implementation

We use Pandas and SKLearn (scikits-learn) to implement our solution.

What’s the best value of K?

So what is the right value of k? Like with PCA, there is no perfect answer towards choosing how many dimensions the subspace should be. When k is not given to you, typically, you would run with many different values of k. Then create a plot of cost(S, X) as a function of k.

Full code.

This cost will always decrease with larger k; but of course k = n is of no use. At some point, the cost will not decrease much between values (this implies that probably two centers are used in the same grouping of data, so the squared distance to either is similar). Then this is a good place to choose k.

Get our awesome Python REGEX course!

--

--