Illustration by Lilian Zhao

Scalable Clustering for Exploratory Data Analysis

By Aurick Qiao, Jacob Jackson

Data is quickly becoming a precious resource for companies across all sectors, bringing with it promises to solve the most complex problems faced by businesses. However, data requires complete, end-to-end processing before much value can be derived from it. Although it’s only a subset of the data analysis and modeling techniques used in AI research, deep learning has drawn the attention of researchers and system builders around the world. This has left other important machine learning techniques, such as density-based clustering, decision trees, and linear models, underserved as the amount of data collected by modern enterprises grows rapidly.

At Petuum, we are interested in exploring all aspects of artificial intelligence, machine learning, and data analysis, including but not limited to deep learning. In this post, we’ll discuss how we built and scaled up a distributed implementation of HDBSCAN, a popular density-based clustering algorithm useful for initial exploration of real-world data. Using Petuum’s distributed systems, we are able to achieve near-optimal scalability for large datasets and performance of up to 150x greater than other popular implementations.

Introduction to Clustering and HDBSCAN

Clustering is the task of organizing a dataset of objects into groups (or clusters) of similar objects. It is an example of unsupervised learning, and does not require each data object to be manually labeled with a category first. As a result, clustering is useful for exploratory data analysis, exploring datasets that are not yet well-understood.

There are many methods that can be used to cluster a dataset, each with their own advantages and disadvantages. One popular method, k-means, is frequently used when exploring very large datasets because its computation can scale well to distributed systems containing many machines. However, it is based on strong assumptions, which limits its usefulness for analyzing real-world data. k-means assumes that

  1. The dataset has exactly k clusters. k is a number that must be specified before running the algorithm. Since clustering is often used for exploratory data analysis when the characteristics of the dataset are unknown, it is frequently not possible to choose a good value for k.
  2. Every data point must be in a cluster. Real-world datasets often contain many data points that are just noise. k-means will assign noise points to clusters, which makes it sensitive to outliers and distorts the final result.

Furthermore, k-means assumes that all clusters are roughly globular in shape and similar to each other in size. It can fail when used on real-world datasets in which cluster shapes and sizes differ and do not conform to those assumptions.

Example clustering using k-means vs. example clustering using HDBSCAN

HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise. It’s a clustering algorithm that overcomes many of the limitations of k-means. For example, it does not require a difficult-to-determine parameter to be set before it can be used. We can explain the other advantages of HDBSCAN by breaking down each term:

  • Hierarchical: arranges points into hierarchies of clusters within clusters, allowing clusters of different sizes to be identified.
  • Density-Based: uses density of neighboring points to construct clusters, allowing clusters of any shape to be identified.
  • Applications with Noise: expects that real-world data is filled with noise, allowing noise points to be identified and excluded from clusters.

To read about how HDBSCAN compares with other clustering algorithms in more detail, we recommend this blog post.

Scaling Up the HDBSCAN Algorithm

The traditional implementation of HDBSCAN consists of two main phases:

  1. Construct the k-nearest-neighbors (k-NN) graph, with an undirected edge connecting each point p to the k most similar points to p. Use the k-NN information to define a new metric called the mutual reachability distance between all pairs of points in the data.
  2. Find the minimum spanning tree (MST) connecting all points in the data according to the mutual reachability distance. This tree represents a hierarchy of clusters, and the individual clusters can be extracted using a simple heuristic.

For more detail on these phases, we recommend this blog post or the original paper.

Constructing a k-NN graph and finding the minimum spanning tree are both computationally expensive and impractical for large datasets — they both scale quadratically with the number of points in the dataset. Our solution is to replace both phases with approximations which are more efficient, have better scalability, and easier to implement in a distributed system. Of course, this solution only approximates the exact result of HDBSCAN, but we find that it performs well in practice.

Approximating the k-NN graph:

To approximate step 1, we use the recently published NN-Descent algorithm. It is based on the principle that the neighbor of a neighbor is likely to be a neighbor. Each point p keeps a list of k other points, which are the k closest points to p that have been found so far. In each update step, two randomly chosen a and b in the neighbor list of a point are compared. If b is closer to a than the farthest point in a’s neighbor list, then the farthest point in a’s neighbor list is replaced by b. ŒRepeating this update improves the accuracy of the k-NN graph approximation with each iteration.

NN-Descent possesses several desirable qualities which make it well-suited for our application. It has been shown to perform well on a variety of datasets and distance metrics, it has a sub-quadratic empirical runtime and can scale to larger datasets, and it can be easily distributed to multiple cores and machines.

Approximating the MST:

To approximate step 2, we fi€nd the minimum spanning tree of the graph produced by NN-Descent rather than the complete graph connecting all data points. The only important edges of the MST are those which connect points in the same cluster, and those edges should mostly be present in the k-NN graph. Doing this drastically reduces the size of the graph input to the MST algorithm, and allows the computation to scale to larger datasets.

Clustering Results

To assess the quality of the approximation, we evaluated the clustering on a 300-dimensional word embedding dataset and a 98,304-dimensional image embedding dataset. We found that the results were meaningful in both cases. Here are example clusters from both datasets:

Example clusters of related words resulting from running our HDBSCAN on a text dataset, and an example cluster of owls from running our HDBSCAN on an image dataset.

We also performed a quantitative assessment of the approximation quality using the Fowlkes-Mallows index. We generated synthetic clusters with 100,000 points, 20% of which were noise, and assigned the remaining points at random among 1000 clusters. We then compared the clustering result of our implementation with the exact clusters produced by the implementation of HDBSCAN in scikit-learn-contrib. In ten dimensions, the Fowlkes-Mallows index is 0.961, indicating a high degree of similarity between the exact and approximate clusterings (the maximum possible value is 1). A table of results for dimensions 1–10 is below:

Performance Results

The fastest HDBSCAN library known to us is the implementation in scikit-learn-contrib mentioned above. We compared this library with our own implementation on the 300-dimensional word embedding dataset. When restricted to a single thread, our implementation finished in 10 minutes, while scikit-learn-contrib/hdbscan took over 24 hours on the same hardware. In addition, we ran a comparison using a benchmark designed by the authors of scikit-learn-contrib/hdbscan. Both were run with their default settings, and both succeeded in finding the 10 synthetic clusters. Here is a graph comparing their runtimes using a range of dataset sizes:

Graph of runtime compared to Scikit HDBSCAN.

Although scikit-learn-contrib/hdbscan is faster at small dataset sizes, its asymptotic behavior quickly catches up to it. From the graph, it is clear that its runtime scales roughly quadratically with the dataset size, while the performance of our implementation scales near-linearly.

Traditional, non-deep machine learning is often used in practice, but is under-represented in today’s research. At Petuum, we believe in the importance of all machine learning methods, and their impacts on businesses around the world. In future posts, we will discuss more examples of machine learning algorithm development happening at Petuum, including both deep and non-deep methods.

This HDBSCAN application was created over the course of two months as part of an internship project. If this is the sort of project you’d like to work on, Petuum is hiring!