Combining Speed & Scale to Accelerate K-Means in RAPIDS cuML

Published in

RAPIDS AI

7 min readSep 26, 2019

Albert Einstein once said “Imagination is more important than knowledge. Knowledge is limited; imagination encircles the world.” While extracting knowledge is essential for obtaining wisdom in any scientific discipline, it’s gleaning valuable insights from our data that drives us towards understanding.

As data sets have grown larger and larger, imagination and intuition for data scientists have been hampered by the realities of compute time. With RAPIDS, a data scientist has access to the tools to act on their ideas quickly, even at the speed of imagination, enabling them to maintain their train-of-thought and flow of inspiration.

RAPIDS Release 0.9 introduced two new multi-node, multi-GPU (MNMG) algorithms to cuML, the RAPIDS machine learning library. Random forests and k-means have been given the ability to scale up & out with the Dask distributed computing library. In this blog, I will focus on how we scaled our k-means algorithm, the first algorithm to fully utilize our new scalable architecture (which I will be explaining in detail in an upcoming blog). By the end of this blog, I hope you are as excited about the future of scalable and performant machine learning as we are!

What is k-means?

In 1963, Sokal & Sneath sparked a general interest in clustering methods with their monograph “Numerical Taxonomy”. This monograph catalyzed the adoption of clustering applications in the broader science, statistics, & data analysis communities¹.

Early forms of the algorithm were known to the statistics community simply as the “sum of squares” criterion, after the measure we now call inertia. K-means is an unsupervised learning algorithm, so it doesn’t require labels for training. It accepts, as input, a set of data points and number of clusters, k. K-means picks a set of cluster centers, called centroids, that minimize the distance from the points to their nearest centroid. The inertia measures the sum of the squared distances of the points to their closest centroid in each cluster.

Visualizing the iterative training of k-means centroids

The k-means algorithm contains two basic stages, with an optional third stage.

Choose an initial set of k cluster centroids
Iteratively update the centroids until convergence or some max number of iterations is reached.
Optional: Use the centroids to predict on unseen data points

K-means is sensitive to the initial choice of centroids. A bad initialization may never produce good results. Since finding the optimal initialization is very costly and hard to verify, we generally use heuristics. Aside from random choice, cuML’s k-means provides the scalable k-means++ (ork-means||) initialization method, which is an efficient parallel version of the inherently sequential kmeans++ algorithm. It combines random sampling along with distributions of point distances to each sample in order to better scatter starting centroids across the space where actual points exist.

K-means assigns each data point to a single ball-shaped cluster, which will be the same size as all the other clusters.² The shape and cluster size constraints in k-means can often be overcome by transforming the input data or using more complicated variants like kernel k-means and spectral clustering, which map the points to a new space based on some measure of similarity, before clustering them. These variants tend to add complexity, however, and can be very sensitive to the choice of mapping function. The plain-old k-means with a few simple preprocessing tricks is still used by data scientists in practice because it is simple, fast, and practical.

Stuart Lloyd used k-means to quantize digital signals in 1957³. K-means is still used for quantizing signals — image and speech processing being modern examples. Quantizing is generally useful in machine learning as a pre-processing technique to shrink the variation between features to a fixed size (k in the case of k-means).

K-means is used by businesses to segment users and/or customers based on various attributes and behavioral patterns. It also finds use in anomaly detection, fraud detection, and document clustering.

The cuML project contains a C++ library with a growing collection of both dense and sparse CUDA primitives for machine learning. These GPU-accelerated primitives provide building blocks, such as linear algebra and statistics, for computations on feature matrices. Many algorithms in cuML, such as k-means, are constructed from these primitives.

Benchmarking cuML’s k-means against Scikit-learn on a single-GPU

We benchmarked the single-GPU k-means implementation against scikit-learn so we could determine the impact of cuML’s GPU acceleration. While scikit-learn is able to parallelize k-means using multiple CPU cores (by setting the n_jobs argument to -1), the GPU k-means implementation continues to demonstrate better performance as the data sizes are increased.

I ran this benchmark on a single NVIDIA GPU (32GB GV100) on an NVIDIA DGX1. The scikit-learn benchmark was taken on the same DGX1, using all 40-cores (80 total threads) of its 2.20GHz Intel Xeon CPU E5–2698 v4. We are seeing more than 100x speedup as the number of data samples reaches into the millions.

Scaling up & out — Giving users the power to start small and grow with problem sizes

Our multi-node multi-GPU algorithms run inside of the Dask environment, making it simple to load massively sized datasets into a distributed Pandas or cuDF dataframe and utilize machine learning models on GPUs. cuML’s multi-node, multi-GPU architecture uses the one-process-per-gpu (OPG) paradigm, meaning each Dask worker process runs CUDA-kernels on a single GPU. The open-source Dask-cuda project makes it simple to start a series of OPG Dask workers.

Once trained, the centroids are kept on a single Dask worker. Prediction is done in an embarrassingly parallel fashion, by broadcasting the centroids to the workers containing data. By moving the centroids, rather than the data, cuML can scale to large numbers of GPUs and nodes. This general design also enables cuML to store and load multi-node multi-GPU models much the same way as the single-GPU model.

To get a feel for cuML’s performance against a CPU-based algorithm in the Dask environment, I benchmarked the training times to compare cuML’s k-means directly against Dask-ML’s k-means. Both algorithms were benchmarked on one or two DGX1 machines, connected by Infiniband, using 8 workers per node. As mentioned previously, the DGX1 contains an Intel Xeon E5–2698 v4 CPU@ 2.20GHz.

Both algorithms used the scalable k-means++ initialization method. The max iterations hyper parameter was kept at the default of 300 and convergence tolerance kept at the default of 0.0001.⁴

Comparison of training times between cuML and Dask-ML

Benchmarking multi-node multi-GPU cuML & Dask-ML k-means on DGX1s

To see how cuML k-means compares to Spark-based algorithms, we trained k-means centroids in Hi-Bench, an open-source, big-data benchmark that comes with implementations for Spark. We ran Hi-Bench on the Google Cloud Platform on an input matrix containing 1.2 billion samples and 50 features, with 5 total clusters. Preliminary results are showing that cuML on 4 DGX1 machines with Infiniband (32 total GPUs) is nearly 200x faster than Spark executing on 20 nodes. These results are also showing cuML is 22x faster than Spark on 50 nodes.

Demonstrating how cuML’s k-means follows familiar APIs

What’s Next?

In the near-term we will be revamping our existing Dask-based linear regression, TSVD, and nearest neighbors implementations to follow cuML’s new one-process-per-GPU design. PCA and logistic regression are also on this list, and should be released around the same time. These will follow a very similar multi-node multi-GPU design to k-means.

I am also writing a more detailed blog on the underlying architecture of our multi-node, multi-gpu algorithms and how our thinking has evolved as we have developed them.

You can try out our MNMG K-means on many cloud providers, and RAPIDS will run on any Pascal or newer NVIDIA GPU architecture. Join us, and do data science at the speed of imagination. Einstein would approve.

¹ The K-means algorithm has been around, in some form, for a very long time. While often credited to James MacQueen, from a paper he wrote in 1967, there is evidence that it was used even earlier. For example, it bears a striking resemblance to the compression method used by Bell Labs for signal processing in 1957, which was developed by Stuart P. Lloyd.

² This is unlike gaussian mixture models, which can assign data points to many clusters of varying elliptical shapes and sizes.

³ Quantizing restricts the space of possible values each column of a vector can take on. This is done by replacing the input signals with their nearest centroids.

⁴ One thing to note is that the cuML k-means benchmarks were able to make use of Infiniband during training, while Dask-ML’s k-means was using only the TCP connection between the Dask workers to share data. Once UCX is fully integrated into Dask, Dask-ML will also be able to take advantage of Infiniband, which should provide an increase in performance.

Combining Speed & Scale to Accelerate K-Means in RAPIDS cuML

What is k-means?

Scaling up & out — Giving users the power to start small and grow with problem sizes

Comparison of training times between cuML and Dask-ML

What’s Next?

Written by Corey Nolet