Discovering Latent Clusters with K-means

Tyler Walker
Codezillas
6 min readMay 27, 2018

--

A really big cluster

As humans, we like to think things in the world fall neatly into discrete categories. These are bananas, those are papayas, these are apples, those are oranges. But even among categories that we consider virtually immutable, there can be discrepancies which violate its strictest definition.

I think it is safe to say that the state of nature is truly non-categorical, and that our categories are artificially imposed upon something which is not inherently binary or discrete. But that does not mean categories are meaningless. Categories tell you something about the statistical tendencies of it’s members, and help to organize otherwise indiscernible blobs of shape and color into identifiable collections.

Also as humans, we don’t know everything, so sometimes there may be patterns in the data that we don’t know about yet. K-means is about discovering those latent clusters: statistical groupings that suggest underlying forces at work.

K Means

The idea is really simple: start with some random categories. Then assign points to whichever category is closest. Finally recalculate the center to be the average of the points within the category. Repeat until either there is no change, or some other stopping criteria is met. The calculated center is called the centroid — in this example, just the average of the points.

K Means is really a form of Expectation Maximization: that is, like many Bayesian models, we are looking for the most likely model given the data. Which function has an expected value which most closely predicts the data?

The dataset I will use is one of drivers: the number of miles they drove, and the percentage of the time they spent over the speed limit. I use a Node.js library called csvtojson to load the csv file, then map over the rows, extracting the distance and % speeding:

const csv = require('csvtojson');let data = await csv().fromFile(drivers);data = data.map(row => {
const x = Number(row['Distance_Feature']);
const y = Number(row['Speeding_Feature']);
return { x, y };
});

So data is an array of (x, y) pairs with x representing the distance and y representing the % of drive spent speeding. Given data in this form, I created some helper functions for equality and distance:

function distance(a, b) {
return Math.sqrt((b.x - a.x)**2 + (b.y - a.y)**2);
}
function equals(a, b) {
return (a.x == b.x) && (a.y == b.y);
}

Whenever doing K-means, you need some sort of distance function for determining which centroid is closest. In this case we are doing Euclidian distance, but for Boolean attributes we could use the Hamming distance, or in 3-D the Manhattan distance. I also created this function for generating a random centroid, which takes into account the range for this particular dataset:

function randomCentroid() {
const xLim = 250;
const yLim = 100;

return { x: Math.random() * xLim, y: Math.random() * yLim };
}

Using this function, I will generate 2 random initial categories. This does not necessarily have to be random: you can also take a guess as to what the categories will be:

let k = 2;
const centroids = [];
for (let i = 0; i < k; i += 1) {
const centroid = randomCentroid();
centroids.push(centroid);
}

Now we are going to iterate over the samples, see which centroid is closest, and then assign the sample the corresponding category. This is sometimes called the Expectation step, although really this step has nothing to do with expectations. Rather, it is simply assigning data points to categories in a deterministic way:

const clusters = [];for (let i = 0; i < k; i += 1) {
clusters[i] = []; // empty array to hold samples for the cluster
}
for (let sample of data) {
const distances = [];
for (let centroid of centroids) {
const dist = distance(sample, centroid);
distances.push(dist);
}
const min = Math.min(...distances);

const idx = distances.indexOf(min);
clusters[idx].push(sample);
}

Next is the Maximization step, where we choose a new centroid that is most likely to have generated the points in its category. In this case the new centroid is just the average of the points in the category. By doing this, the chosen center maximizes the likelihood of seeing the given data. The following function averages the x and y values in the cluster:

function computeCentroid(cluster) {
const x = cluster.reduce((acc, sample) => acc + sample.x, 0) / cluster.length;
const y = cluster.reduce((acc, sample) => acc + sample.y, 0) / cluster.length;

return { x, y };
}
for (let i = 0; i < clusters.length; i += 1) {
const cluster = clusters[i];
const newCentroid = computeCentroid(cluster);
if (equals(newCentroid, centroids[i])) {
stop = true; // stop the algorithm if nothing changed
}
centroids[i] = newCentroid; //assign the centroid
}

One last check to reinitialize a centroid if no points get assigned to it:

for (let i = 0; i < clusters.length; i += 1) {
const cluster = clusters[i];
if (cluster.length === 0) {
console.log('None assigned to cluster: ', centroids[i])
centroids[i] = randomCentroid();
console.log('New centroid: ', centroids[i]);
}
}

Finally we’re ready to see some graphs. Here are the result of my running the algorithm with k=2. I graphed everything using Plotly:

K-Means algorithm with k=2

I drew one plot for the expectation step and one plot for the maximization step. You can see how the centroids are each immediately drawn to the nearest center of mass. If there are many samples around a point, K-means assumes that the source is likely nearby. It only took 4 iterations of the algorithm to rest into a stable position. It is possible that given other random starting points, some of those in between points could have been captured by the other group. So, initialization matters.

Although there do seem in fact to be two major categories, one of drivers who drove a relatively short distance, and then a discontinuous leap to drivers who drove relatively longer distances, we can also intuit that within these categories there seems to be additional clustering in the vertical direction. In both groups, there is a highly concentrated group of drivers who spent most of the time under the speed limit, while a sparser group of drivers spent a high percentage over the speed limit. Given these observations, let’s try with k = 4 and see what we get:

K-Means EM with k=4

I think it is interesting how the red centroid gradually captures more of the points from the yellow centroid. It is as if the clusters have electromagnetic charges that repel each other. It took 7 rounds for the centroids to rest in a stable equilibrium.

Conclusions and Observations

Like constellations, the groupings produced by K-means are not always inherently meaningful, outside of the psychological parameters that may make them intuitive. There are ways to help guarantee the most accurate groupings. One way is to iterate over values of K, and choose the one which minimizes the average distance between cluster members. A caveat to this is that increasing K will always reduce this metric, so the trick is to choose the point at which increasing K only trivially reduces the average distance.

It is also worth noting that the above method is easily generalizable to higher dimensional spaces: just average over all the dimensions to find the new centroid.

You can also try an Ensemble of K-means predictions, and then taking the top three hypotheses and applying majority rule for category membership. Again, K-means is used when we do not know whether there are underlying causal categories or not. This is akin to feeling around in the mud hoping to find a diamond. Sometimes the result will correspond with what we can see or understand intuitively. Other times it can highlight unexpected character in the data, or explores dimensions that are too complex to understand. K-means shines a light in places too dark or too deep for the ordinary senses to reach.

References

  1. Artificial Intelligence: A Modern Approach — https://www.amazon.com/Artificial-Intelligence-Modern-Approach-3rd/dp/0136042597

--

--