Unsupervised Learning

Clustering. An heuristic on K and how to deal with noise

Published in

Hexacta Engineering

7 min readMar 5, 2020

Clustering is a far known technique of unsupervised learning to find implicit structure in a data set without explicitly been told what to look.

There are many algorithms to perform this task but are based on different techniques and therefore they don’t produce the same result.

Fruits in a store groupped by their type

Goal is then to generate K groups or clusters’ points (every item from our data set) that are close to each other, where close depends on which features we use to represent data and how we measure error.

In this post we will use a simple data set with geo located points and we will discuss some ways and ideas to generate clusters based on how physically close those points are in space, compare and use different algorithms, measure results and deal with possible outliers.

Let’s take a look at our data set.

If we look at the big picture (left) we may think that we only have a few groups. And in some context that could be useful, but taking a closer look, we find that we could have some heavy dense areas within a region. So.. what number of K do we choose? i.e. How many groups should we create?

Density Based Clustering. Finding K.

It’s obvious the answer to those questions: It depends.

It’s not that obvious that there are some categories for clustering algorithms and we should find the one that best fit our model. Some of them are Centroid Based , Density Based, Distribution Based and Hierarchical clustering. Most used and famous clustering methods are centroid based; but to use them we need to define a K number.

First approach could be to apply the elbow method and that’s it, we take not only the K value, but also the clusters that minimizes the error. The problem of applying a blind search like that is that it is really expensive.

Let’s take in advantage that we have a data set with many dense clouds of points and try a smarter method: Density Based Clustering.

In density clustering, a cluster in a data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density. The data points in the separating regions of low point density are typically considered noise/outliers.

One disadvantage of clustering algorithms is high complexity (clustering problem is np hard), so we have to be careful if we have a large (or high dimensional) data set. In this case we decided to take a representative sample from our data set to test different algorithms and to tune hiperparameters; then run the chosen algorithm with the whole data.

Another disadvantage is that density clustering produce outliers (noise) in zones that are in the middle of other regions (frontiers), or in zones that are far to reach. In this aspect we can work by tuning min cluster size (in our case we don’t mind to have small clusters) and epsilon value; basically is a measure of reachability from one point within a cluster to decide if its neighbors belong or not to that cluster. Epsilon values does not behave exactly the same for different density algorithms but it’s main idea remains the same.

Most popular density based algorithms are DBSCAN and OPTICS. We choose an improvement to DBSCAN, HDBSCAN, and we compared clustering results against similar parameters values on OPTICS. The result we got shows us that HDBSCAN is a faster implementation and much noiseless with similar number of centroids retrieved (see graph below). For small epsilon values, OPTICS has little less error creating almost same amount of clusters in every iteration. Nice try OPTICS, but HDBSCAN converges much faster to a solution, maybe next time.

It’s also fair to say that HDBSCAN tends not to work well with high dimensional data, right now that’s not really our case.

Looks like we have a winner…

Epsilon vs Error vs Outliers. Choose one, you can’t have all.

Ok, we found our method. Now let’s find the best clustering. In clustering, best depends on our model, our context and what do we want to do. Depends, again.

The main concept for this section is about how can we choose epsilon values to fit our model. Since this value indicates distance between one point to another to decide if they belong to the same cluster, let’s find out where are all our point’s neighbors. For each point, we will calculate the distance to its nearest neighbor. We used the sklearn implementation for that. Be careful. It’s a really expensive method that could take a lot of time. Consider doing it with a subset of your data to have a least a clue of epsilon values.

Applying 1-nearest-neighbor for every point

Setting a treshold of 0.1 we cover more than 99% of cases

In this case, more than 99% are less than 0.1 far away from nearest neighbor. Distance between points is given by norm2, so we are not assuming nothing specific from our data set (like the fact that they are geolocated).

Let’s run HDBSCAN.

We have two aspects to take care about: Error and Number of Outliers produced. So we will visualize those values as a function of epsilon.

Comparison between outliers produced and error of clustering

With these graphs we have a good ground to decide which epsilon value to take. We can see that if we keep the 0.1 threshold value we will have very few outliers but a high error. If we’re not willing to pay that error price, we will have to choose smaller epsilon values and deal with noise. Let’s go with that…

Don’t worry, we’re almost done…

Minimizing error and removing noise with KMeans

If we want all points to be close to its centroid’s cluster, we need to choose an small epsilon value. Let’s fit our model with 0.001 epsilon value and run HDBSCAN.

The right image correspond to the squared region on the left. Black points represent outliers. Colored points are our clusters

We’ve got a really small error value (can check it in top of the graphics), we’ve created over a thousand clusters (centroids are marked with x), but we also got almost 10% of outliers. So what do we do with those points?

Well, now we do have centroids. So why don’t we run a centroid based clustering algorithm? We know that those algorithms generate K partitions over the space, so they don’t produce noise at all.

We will use a well known algorithm to do it: KMeans. Off course, you could try another one (Mean Shift, Spectral Clustering, etc) based on your needs. For our tests, KMeans was the one that did the trick.

We used sklearn implementation of KMeans. We fit our model passing HDBSCAN centroids and setting a small tolerance to converge based on the idea that the job is almost done and it should be an easy run for KMeans.

Off course, noise will be gone but error will necessarily increase. Also, as we are running kmeans, centroids will tend to move a little from its original position to minimize this error increase. Let’s see what happens.

Final clustering for that epsilon treshold using hdbscan and kmeans

We did it! KMeans only took 10 iterations to converge to a solution. Our centroids move a little bit to minimize error that off course increased, but kept same magnitude order that we had, and now we have our full noiseless classification.

This is not the only solution. We’ve already learned that we could pass a different epsilon value and go through the whole process again to find the optimal solution for your needs.

If you want some more…

Here, we will present some alternatives that may work for you depending on your needs. We’ve tried all of this and we got the smallest error value with the steps above mentioned. You could also may have other awesome ideas to explore.

Run kmeans with given K (or a value around it) from HDBSCAN result but don’t pass centroids array. Off course, we’re ignoring collected information and doing this could take much longer to converge to a solution.
Not run kmeans, assign each noise point to its nearest cluster. If we have just a few noise points, this is a pretty good solution because we avoid setting up a centroid based algorithm, and is a fast solution.
If our context allows it, create a new cluster for each outlier point. Similar to the option above, but we will have unitary clusters. This could be useful if new points may arrive in the future.

[EXTRA] HDBSCAN with no outliers

If you reach this part, then you really are into clustering. kudos!

As you can see from outliers graphs, there’s an epsilon value that gives not noise at all running just HDBSCAN. This value is pretty high, that means that many points will belong to same cluster, ergo, we will have few number of clusters. Let’s take a look at result.

9 clusters found with centroids in black x

We’ve just got 9 clusters. This approach is good to detect the biggest density regions for our data and maybe we even could define a different strategy for each cluster found.