Scaling kNN to New Heights Using RAPIDS cuML and Dask

A distributed multi-node, multi-GPU implementation now available

Victor Lafargue
RAPIDS AI
12 min readDec 9, 2020

--

When trying to predict an outcome or feature of a particular observation in a dataset, one of the most natural and intuitive options is to consider “similar” observations and check their outcomes. k-Nearest Neighbors (or “kNN”) formalizes this intuition into a simple machine learning method. For each new observation, it finds the k most “similar” observations with known outcomes where the similarity is defined in the form of a distance metric.

kNN can be used to build classification or regression models, or it can be the backbone of a search algorithm, like the content-based image search example in this blog.

For the purpose of the search, observations are gathered into a data structure known as the index. New, unlabelled observations upon which the search is performed are called the query. The complexity of the kNN search is dependent upon the size of the index, query, and number of features. Fortunately, the kNN algorithm can greatly benefit from the parallelization of some of its operations. That’s where GPU-accelerated kNN comes into play. Indeed, the RAPIDS cuML implementation of kNN search on GPU, based on Facebook’s FAISS library, provides impressive speedups (see blog post).

However, this implementation is limited to the amount of data that can fit in a single GPU, which can sometimes be insufficient for extremely large datasets. The solution is to scale the algorithm by splitting the workload to multiple GPUs. This is now possible thanks to the multi-node, multi-GPU (MNMG) kNN algorithm in cuML. This article will do a deep dive into the design of distributed operations and its scaling behavior.

This article aims to show how the distributed kNN algorithm works internally and how it can be used in your projects. More specifically, it will present how operations are distributed across a cluster of workers, including how they communicate to build an index and collaborate in queries. Then, the article will discuss the effect of the design on the scaling of the algorithm. Finally, we provide a demonstration of a realistic use case for the distributed algorithm.

How does distributed kNN work?

Topology

cuML’s implementation of distributed K-Nearest Neighbors works according to the topology of a fully connected network. Multiple workers, each attached to a single-GPU, directly interact with each other. For more information on cuML’s distributed models, an introductory description is available: ML in Python.

A scheduler first lists the workers available in the cluster. A client will then initiate the loading of the data, which will be partitioned across workers. Then it will trigger the distributed operations to perform the search. All of the workers run the same set of instructions on their local data partitions.

Partitions

Data necessary to the distributed K-Nearest Neighbors search is stored in partitions shared among the workers forming the following distributed arrays or dataframes:

  1. The index: contains reference observations, size (n_references x n_features)
  2. The query: contains observations for which to find the nearest neighbors, size (n_queries x n_features)
  3. The indices: follows the distribution of the query, will be filled at the end of distributed operations, will contain the indices of the nearest neighbors in the index for each of the query points, size (n_queries x n_neighbors)
  4. The distances: follows the distribution of the query, will be filled at the end of distributed operations, will contain the distances of the nearest neighbors toward their respective query points, size (n_queries x n_neighbors)

Workers can have multiple partitions of different sorts (index and query) simultaneously.

Distributed Operations

Queries are processed sequentially as a series of batches. For each batch of queries:

a) Queries broadcasting: The unique owner of the currently processed batch of queries broadcasts it to all the other workers.

b) Local kNN: All of the workers having at least one index partition run a local kNN. The local kNN searches for the nearest neighbors of the currently processed batch of queries in its locally stored index partitions. The produced indices and distances correspond to the local nearest neighbors for these query points.

c) Local kNN results gathering: The workers that performed the previous step send the result of their local kNN back. They then start processing the next batch of queries. The owner of the currently processed batch of queries collects all of this data.

d) Reduce: The owner of the currently processed batch then performs a reduce operation on the data it received. It consists in merging the results of the local kNN searches. This produces a set of global nearest neighbors indices for each of the queries along with the corresponding distances between query and newly found nearest neighbors.

At the end of distributed operations, once all of the queries have been processed, the user has a first distributed array with the set indices of the nearest neighbors for each query and a second one with the distances between each of the query points and their nearest neighbors.

The distributed algorithm is designed to optimize GPU usage in the cluster and to minimize communications and idle time of workers.

Technologies and Performance

Like all distributed algorithms in cuML, MNMG kNN builds on the open source Dask library. Dask is a Python library for distributed computing. It allows a user to create a cluster with a scheduler and a set of workers. Each Dask-CUDA worker is attached to a dedicated GPU. The Dask client can then connect to the cluster and distribute data on it. The index, query, and outputs (indices and distances) are indeed all stored as distributed Dask arrays or dataframes.

(Dask-CUDA is an addition to Dask that allows the creation of GPU clusters while optionally configuring networking via the high-performance UCX-Py library.)

Once the data has been distributed to the workers, the workers form a clique that can start the distributed operations. The communications required during that work do not go through Dask, but rather can leverage the low-level RAPIDS RAFT comms library, which is an abstraction over the UCX and NCCL GPU-aware networking layers.

The performance of distributed kNN depends on the partitioning of index and query distributed arrays at the start of the algorithm and on the performance of the underlying networking interconnect used.

Benchmark and Scaling Behavior

This benchmark first aims at showcasing achievable speedups of GPU-accelerated distributed kNN in high-performance computing settings. It will also give a better understanding of when it is appropriate to distribute your kNN work. To this end, the scaling behavior will be studied in regards to different parameters (space dimensionality, number of samples in the index, number of samples in the query, and number of nearest neighbors returned).

Many developers are already using the Scikit-Learn kNN API, so we want to start by comparing cuML’s GPU-based implementation with that baseline.

This chart showcases achievable speedups with cuML’s GPU-accelerated and distributed kNN. It was produced on an NVIDIA DGX server featuring 8 A100 GPUs with 40GB of memory each. The client and the workers both had appropriate RMM pools allocated. RMM is the CUDA memory management library of the RAPIDS software. It aims at making the best usage of GPU memory. The cuML implementation was distributed on 2 and 4 GPUs and was compared to the Scikit-Learn implementation used as a reference. The Scikit-Learn implementation runs on all CPU cores (n_jobs=-1). The search was done in a space dimensionality of 256, it had 800 queries, and was configured to return the 8 closest neighbors. The number of samples in the index, on the other hand, was progressively increased.

As shown in the chart, the larger the index gets the higher the speedup is in comparison to the reference implementation. The 4 GPUs solution begins slightly slower than the 2 GPUs version, because it requires a longer phase for initialization and synchronization. The 4 GPUs solution is however progressively becomes the optimal solution as the problem size grows. Note that this benchmark was produced with index partitions of convenient sizes of 500K samples. Sometimes even higher speedups are achievable with a smaller number of large partitions.

This figure compares cuML’s single-GPU and distributed implementations to Scikit-Learn’s implementation that is used as a reference:

These tests were performed on an HPC cluster featuring 2 NVIDIA DGXs with 8 Tesla V100 GPUs of 16 GB memory each. For these measurements, Dask-CUDA had Infiniband and NVLink enabled. The client and the workers both had appropriate RMM pools allocated.

These charts compare the runtime of kNN searches of different levels of complexity for the three implementations. By default, a search had an index filled with 50000 samples and a query made out of 800 samples The index and query had 256 features, and 8 closest neighbors were returned. On each of these charts, one of the parameters was tested with different values. The cuML search was distributed on 2 GPUs. These relatively small values already make it possible to visualize how cuML’s distributed implementation behaves in comparison to a reference CPU implementation.

The runtime of CPU implementation increases linearly with the space dimensionality and quadratically (O(n_samples*n_queries)) with both the number of samples in the index and the number of samples in the query. On the other hand, the runtime of distributed kNN seems stable for these low values. The scaling behavior of distributed kNN will be showcased for higher values.

Most of the speedup in these charts is not attributable to the distribution of the algorithm but to its parallelization on GPU. In fact, the single GPU implementation is often the faster solution for smaller data sizes. This is because this implementation does not have any cost related to setting up the distributed work and communications between workers during the search itself.

This figure showcases the runtime of the distributed kNN algorithm for different numbers of workers and an increasing number of samples in the index. It features strong scaling by displaying the runtime for different numbers of workers. It does that for several problem sizes. This chart also makes it possible to visualize how many samples can be processed in a given time by the different number of workers thus displaying weak scaling.

As you can see, in any configuration, the runtime appears to increase linearly with the number of samples. Note that in these measurements, all of the queries were part of a single query partition. Query partitions are processed sequentially and each additional query partition requires its share of communications.

More workers should normally equate to more synchronizations and communications. Indeed, additional startup time is indeed visible for larger clusters at the start of the chart. It is especially visible when using more than one node as inter-nodes communications might be slightly slower (see 16 GPUs case). However, when the number of samples increases the proportion of the runtime due to communications is reduced. For large datasets, communications produce a very minimal overhead that is largely compensated by the distribution of work.

This chart displays the runtime of different cluster configurations for different numbers of queries. The queries were all part of a single query partition on one of the workers. The batch_size parameter that sets the maximum number of queries processed at once in a given query partition is set to 16384.

We can observe that below that limit the runtime is almost unaffected by increases in the number of queries. Above that limit, each query batch is processed sequentially cluster-wide (see distributed operations section above). The chart could indeed almost take the form of a step chart with additional measurements. As usual, a larger number of workers scales better. However, unlike the chart with increasing index size, the incentive to add more workers is not as important. This can be explained by the sequential processing of query batches, the additional communications, and the multiple worker synchronizations required for distributed operations.

Indeed the number of communications (batching mechanism apart) can be computed as such n_communications = 2 * n_query_partition * n_workers. In order to get more queries to be processed in parallel, it is recommended to limit the number of query partitions, whenever possible, and to increase the batch_size model parameter.

Finally, running a distributed kNN search for a larger number of nearest neighbors or in a larger space dimensionality doesn’t seem to affect the runtime much. Increasing these two parameters results in larger exchanges between workers. However, larger exchanges account for very minimal overhead. The additional work on the workers is largely sped up thanks to the very performant GPU acceleration of the algorithm. For these two parameters, the scaling behavior is not related to the distribution of the algorithm but largely more to the parallelization on GPU.

Additional note on performance: When working with GPUs of limited memory capacities, and as your RMM pool allocation might take considerable space, you may face out of memory errors. This is due to large index partitions being processed at once on workers. This can be prevented by creating more index partitions of smaller sizes. Be aware however, that this may negatively impact performance as several index partitions on a given worker will be processed sequentially node-wide.

Demonstrating kNN for large-scale image search

Google’s 2020 Landmark Recognition Challenge was a Kaggle contest to build the best model to determine which famous landscape is contained in a specified picture. (e.g. is this a picture of the Eiffel Tower or the Sears Tower?) To solve this problem, competitors typically used methods that converted images to embeddings — lower dimensional, numeric representations of the image’s contents. The winning solution was designed by Christof Henkel from NVIDIA and Philipp Singer from H2O.ai. To read more about it, you can take a look at their publication.

This demonstration shows how to quickly perform an exact search on very large datasets with the use of GPU-accelerated distributed kNN. The embeddings of a random selection of these images are used to form a query for a distributed kNN search. For each of these, the algorithm will look for the set of its closest embeddings. The images that correspond to this set of embeddings are supposed to be of the same landmark or otherwise somewhat visually similar to the query image.

In this demonstration, we used the winning model to generate embeddings for a curated selection of approximately 1.5 million images. The embeddings contained 512 components relevant to the recognition task. The index containing all of the embeddings amounts to approximately 3 gigabytes.

Here is a visualization of the results. You can see the query image at the top of the animation and the 8 closest neighbors images below it.

Try It Out

Wrapping Up

Single GPU acceleration provides a great boost to kNN searches. However, for some applications, another level of scalability was required to run searches of larger datasets. The distribution of k Nearest Neighbors on multiple GPUs answers this need. The cuML distributed implementation of kNN offers a scalable design, making the best use of the technologies offered by a modern HPC GPU cluster.

The kNN model provides you with the most similar observations in regards to the query provided. It is then possible to use their outcomes to form a prediction. That’s where the kNN Classifier and Regressor models come into play! These are already part of cuML (in single GPU and distributed versions) and will be the topic of a forthcoming blog post!

Stay tuned by reading our blog posts! If you run into any issues, please do let us know via the cuML issue tracker. You can also get in touch with us via Slack, Google Groups, or Twitter. We always appreciate user feedback.

--

--