Machine Learning Algorithms For Every Occasion: Take 2

Clustering with Unsupervised Learning

In my last post, we talked about classification with supervised learning. Today, let’s visit an entirely different family of machine learning problems that we can tackle with unsupervised learning.

Unsupervised learning is about making sense of your dataset without having explicit labels on the data points.

Borrowing from our Oscars example, if we just had a big list of past Oscars winners, instead of trying to decide Win or Not Win for some particular nominee, we might instead be interested in discovering groups of similar winners, or finding unusual winners. In practice, applications of unsupervised learning include outlier detection, ranking (finding the most “important” data points), association rules (finding data points that go together), and clustering (finding the different types of data points).

Let’s first focus on clustering. Clustering is useful in many situations: not only can we identify the natural groupings in a data set (who are the viewers of this TV show?), we can also assign new data points to the groups (what species is this animal?), or find points similar to a new example (what do similar cancer cells look like?).

A clustering algorithm takes as input a set of data, with features, and produces an assignment of each data point to a group. We don’t know what the groups are: that is what the algorithm must decide. Now we also don’t know what the “best” clustering is: there’s no test data set that we can get a test error for. Intuitively, we know that “similar” points should be in the same group, and “different” points should be in different groups.

K-Means and K-Medians Clustering

K-Means is the most popular clustering method. All we need to do is provide the number of clusters (groups) k, and k initial guesses for the centers of each group (means).

The algorithm goes like this: for each point in our data, assign it to the closest mean. (We must decide what measure of closeness we will use). Then we need to update the means since our old means may not be the centers of their clusters after a new assignment. Rinse and repeat until we are out of data points and none of them change clusters (convergence). If we use Euclidean distance (L2-norm) as our similarity measure, K-Means is guaranteed to converge.

What are we really doing here? A good clustering will have small distances between each point and its assigned mean. So we can interpret K-Means as minimizing the distances from each point to its center (the squared distances if we’re using Euclidian distance).

If our data has lots of outliers, we can consider using K-Medians and the L1-norm distance function instead, and this will be a bit more robust to outliers since we’re no longer squaring distances a la Euclidean distance.

How much does all of this cost? We are going to calculate the distances between each point and each mean. This is O(ndk) for n objects, d features and k means. To update the means, we have to add up all the points in each cluster and average them, which costs O(nd). That’s pretty fast!

Drawbacks of K-Means:

  • Every object must be assigned to a cluster. This means we could end up with a bad model if we had lots of points that weren’t close to other points but our algorithm forced them join a cluster.
  • We don’t know how to choose k.
  • It is very sensitive to initialization. If we chose our initial means poorly, we may never be able to produce a satisfying clustering.

An obvious way to deal with the last issue is to try a bunch of different initial points and pick the best one, and this is the traditional approach (random restarts). But now we’ll see something a bit slicker…


Instead of starting with k initial means, we’ll just have one data point x as our first mean. Then we’ll compute the distance from every other point to this mean.

Now for each point we want to find the smallest distance between it and a mean. On the first iteration, we only have one mean. These distances are going to help us pick the next mean — we will choose a data point proportional to the smallest distance squared, as our next mean. Repeat all of that until we’ve chosen k means.

This algorithm basically guarantees us means that are far apart from each other, which addresses the initialization problem we had with standard K-means.

Density-Based Clustering (DBSCAN)

This type of clustering finds dense groups of points, and objects in sparse regions do not get assigned to a cluster. One DBSCAN application that is easy to visualize is population geography. This is the first non-parametric model we’ve seen, and it’s non-parametric because we don’t need to provide k, the number of clusters. But this also means that our model will be more complicated the more data we have (more clusters for example).

Although we don’t have to provide k, we do have new hyperparameters: radius and minPoints. The radius tells us the maximum distance between two points considered “close”, and minPoints is the minimum number of reachable points required for that region to be a cluster. A point with at least that number of close points is called a core point.

Our clusters are defined by these core points, and we just need to consolidate core points that are reachable from each other to form a supercluster from the core points and all their reachable points.

So not all points will necessarily be assigned to a cluster, which might be what we want sometimes. DBSCAN, unlike K-Means is not sensitive to initialization, however, we do need to decide how to pick radius and minPoints. A returning issue is the curse of dimensionality: we need a lot of points for the algorithm to do well. Clustering a new example is also pretty expensive since we need to compute its distance to every other point to find the reachable points.

Density-Based Hierarchical Clustering

But what if our data doesn’t have the same density everywhere? We might not be able to pick a radius that captures the clusters we want.

In that case, it’s hierarchical clustering to the rescue!

We need different radii. So, we’ll keep minPoints fixed, but we’ll try different values for radius and record the clusters we get as we do that. The final results will be a tree of different clusterings for different radius.

One common method for doing this is agglomerative clustering. We start with each point as its own cluster (n clusters), and then merge the two “closest” clusters at each step (decreasing the number of clusters by 1), and repeat until we have one giant cluster.


Slides from