Fast Support Vector Classification with RAPIDS cuML

Published in

RAPIDS AI

9 min readMay 4, 2020

Support vector machines (SVMs) are supervised learning methods that can be used for classification and regression. In this post, we will discuss how you can use the SVM package in RAPIDS cuML to perform fast support vector classification on a GPU. Some data scientists shy away from trying SVMs, as their runtime complexity grows quadratically in the number of samples and can become quite slow on a CPU. cuML SVM can provide a speedup of 500x relative to scikit-learn SVM and it is up to 50x faster than the multi-threaded ThunderSVM library on a CPU. With a GPU-accelerated version, large-scale SVMs become quite practical, and training on a million examples is possible within a few minutes.

We will start with a quick overview of SVM classification, followed by demonstrating how to use cuML SVM. Afterward, we present benchmarks to demonstrate its performance. Finally, we will explain how the training is implemented in cuML.

Support Vector Classification

SVMs are flexible classifiers based on combining the idea of large margin classifiers with the kernel trick to create a sparse ML model. While it is a beautiful topic in itself, we will not go into the theory of SVMs. The interested reader should refer to the textbook from Bishop or the original paper from Cortes and Vapnik. There are also several blog posts that provide an introduction to SVMs.

It is easy to get an intuition about what SVM does in a simple example. The following figure presents a 2D dataset with two classes, labeled with +1 and -1 (blue and orange dots). A linear SVM would separate the datasets with a straight line, which is chosen to have maximal distance from any of the data points. Points from the left of the separator are classified as +1 while points from the right are classified as -1. It is important to note that we need only a few samples from the dataset to define the separating line: the samples that are closest to the separator (x₁, x₂, x₃). These are the support vectors. The normal vector (w) of the separator can be expressed as a linear combination of these vectors.

Mathematically, our aim is to find a decision function F(x), which tells for any input vector, which class does it belong to. The predicted classes are defined as the sign of the decision function (the separator in the figure represents F(x)=0). In general, the decision function has the form

Here xᵢ and yᵢ are the training vectors and labels, the sum goes over all training vectors, and the αᵢ coefficients and b are the unknown values that we need to find when we train the SVM. The coefficient α is non-zero for only a subset of the training vectors, which are called the support vectors. This way SVMs belong to the family of sparse models. This is visible in the figures above where the support vectors are highlighted with black circles.

K is the kernel function. There are great visualizations on how the kernel functions make an SVM access a high dimensional feature space where the data is easier separable. This is done without actually mapping the training vectors into the higher dimensional space. This allows us to have a non-linear decision boundary in the original feature space.

Classification Using cuML SVM

Using cuML SVM is simple. The SVC classifier in cuML is designed to be a drop-in replacement for scikit-learn’s SVM module. To illustrate how to use it we first generate a dataset:

from cuml.svm import SVC
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=137)

The created dataset is illustrated above in the introduction. We create the classifier and set its parameters

svc = SVC(kernel=’rbf’, C=10, gamma=1, cache_size=2000)

We have to define the kernel function and its parameters here. Popular choices for the kernel functions are:

Linear K(x₁, x₂) = x₁ · x₂ i.e. regular dot product
Polynomial (‘poly’) K(x₁, x₂) = (γ x₁ · x₂ + coef0)ᵈ
Radial base function (‘rbf’) K(x₁, x₂) = exp(- γ |x₁-x₂|²)
Sigmoid K(x₁, x₂) = tanh(γ x₁ · x₂ + coef0)

Parameter C is a penalty parameter that controls overfitting: small values of C would decrease overfitting. One would typically use cross-validation to select C. The difficulty is that C interacts with other parameters. For example, in the case of an RBF kernel, it is recommended to search a 2D grid of C and gamma, as these examples from scikit-learn and Towards Data Science demonstrate.

Calculating the kernel function can be time-consuming, and we use a software-defined LRU cache in GPU memory to store kernel matrix values. The cache_size parameter is used to define its size. The larger cache size usually leads to better performance. In this example, we use 2000 MiB.

After the parameters of the classifier are set we are ready to fit it to the training data

svc.fit(X, y)

After the model is fitted we can make predictions the usual way

y_pred = svc.predict(X)

Until cuML 0.13, the predicted array was stored on the GPU as a cuDF Series object and you could use y_pred.to_array() to get a copy on the host. Recently we have made the input and output format more configurable. See the cuML notebook for a complete example.

Benchmarks

To illustrate the performance benefits of our SVM package, we compare cuML SVC to scikit-learn’s support vector classifier and ThunderSVM. Under the hood, scikit-learn uses LIBSVM, a popular, CPU-based SVM package. scikit-learn only interfaces with the serial version of LIBSVM. To have more challenging competition, we measure against ThunderSVM which has a parallel CPU and GPU implementation.

We use the forest cover type dataset for our benchmark. This is originally a multi-class classification problem. Here, it was transformed into a binary problem by checking whether the cover type is “lodgepole pine” (class 2) or not. We scaled the first ten columns of the data. The rest are binary input, and it was left unscaled to keep the sparsity pattern. We used the RBF kernel with C=1 and gamma=1 parameters for the benchmarks.

For the GPU measurements, an NVIDIA V100 GPU was used. For the CPU measurements, a server with two Intel(R) Xeon(R) Gold 6148 CPUs (40 cores in total).

Time to Train

Training a support vector classifier is resource-intensive: for this dataset and parameter range, the execution time scales with O(n_samples²). We measured the time to fit an SVC over a range of input sizes (randomly chosen subsets of the dataset).

The CPU implementations have very large execution times. The last training set where we fitted scikit-learn contained 100k samples. At that point, scikit-learn takes 1126 seconds to fit, while cuML finishes in 2.1 sec, a 500x speedup! Going forward to larger sample sizes, we only tested the parallel SVM packages. cuML’s single GPU SVM package is 50x faster than ThunderSVM-CPU on 40 CPU cores. The reason is that the GPUs excel at the time-consuming kernel function calculation. The middle figure zooms onto the curves that show GPU training time for cuML and ThunderSVM. The training time with cuML is 22% faster than ThunderSVM-GPU for this dataset.

We should note here that scikit-learn (LIBSVM) uses double-precision floating-point numbers. cuML SVC supports both single (fp32) and double-precision (fp64). Single precision delivers 2x higher throughput while it maintains high accuracy. For the tests above ThunderSVM was used in its default configuration, which is single precision.

Time to Predict

While the training time strongly depends on the dataset and the model parameters, the prediction time can be described by a simple formula: O(n*n_sv*m), where n is the number of samples to predict, n_sv is the number of support vectors and m is the number of features.

For each of the classifiers trained in the previous section, we have measured the prediction time, using n = n_train vectors for the prediction. It is worth noting that nonlinear SVMs with a large number of features can still have a large number of support vectors O(n_train). Because of this, the prediction time shows a quadratic trend in our measurements. The prediction speedup of cuML SVM is even more impressive than its training speedup. It is more than 4x faster than ThunderSVM on GPU. Compared to ThunderSVM CPU it is 88x faster and compared to scikit-learn using 100k samples, it is 1000x faster.

You can find more details about these measurements in our benchmark notebook.

Implementation Details

How do we select the support vectors and their coefficients? Our goal is to maximize the margin (the distance of the separating line from the dataset). This leads to a quadratic optimization problem for the alpha parameters. There are several efficient parallel solvers for SVMs [Tavara], cuML SVM is a fresh addition to this list. The cuML solver integrates into the RAPIDS ecosystem, handles in-memory GPU data, and improves performance compared to existing implementations.

To solve the optimization problem we use the general decomposition idea by Osuna (see [Joachims] for a concise description). The original optimization problem aims to find the alpha parameters for all the training vectors. This is decomposed into a smaller problem: select n_ws elements from the training vectors. We call this a working set. We optimize alpha for the working set while keeping the coefficients of vectors outside the working set constant. Afterward, we choose a new working set and repeat this process until convergence. A popular variant of this algorithm is Platt’s sequential minimal optimization (SMO), which has a working set of only two elements. In this case, the optimization problem can be solved analytically [Platt, Keerthi].

The solver in cuML SVM is based on the optimal hierarchical decomposition idea by Vanek et al. (OHD-SVM). It also employs the general decomposition idea but applies it at two levels. The following figure illustrates how our solver works.

The first decomposition uses a working set with 1024 elements. To identify the best 1024 elements we check the gradient of our optimization target and use a heuristic to identify the best candidates. We precalculate all the kernels matrix elements for the working set.

To solve the optimization problem for the 1024 elements, we do another decomposition. This time we select only two vectors and use SMO to solve the second subproblem. We do several iterations with the SMO solver (inner iterations), and calculate the change of the alpha parameters. This is implemented in a single GPU kernel. After the SMO solver is finished, we update the gradient, which is used in the next step of the outer iteration to select a new working set.

There are a few control parameters for the training procedure:

Tol: convergence is checked up to this tolerance. It defines the allowed deviation from the Karush–Kuhn–Tucker optimality condition (see [Keerthi] for definition).
Max_iter: maximum number of outer iterations.

All three GPU solvers that we mentioned here (cuML, ThunderSVM, and OHD-SVM) use the same optimal hierarchical decomposition idea, but there are differences in the heuristics of how to select the working set, and how to calculate and cache the kernel values. Additionally, cuML SVM contains a few improvements compared to the other implementations mentioned above. To ensure the best performance, we use efficient block and device reductions using the CUB library. For the RBF kernel, we merge the epilogue calculation with the distance calculation using CUTLASS to save memory bandwidth.

Known Limitations and Future Work

The cuML SVM package is still in development and it does not yet offer as wide a range of functionality as LIBSVM or ThunderSVM. Most notably, it only implements binary classification and regression, and it does not have nu-SVM and one-class SVM. The cuML team is working to address these limitations. Additionally, cuML SVM does not handle sparse input data. This is not a problem for the example presented here, where the training data has 22% of non-zeros, but it isn’t ideal for very sparse datasets.

Conclusion

Support vector machines are popular classification models with a beautiful mathematical theory. RAPIDS cuML provides a fast support vector classifier to train and predict efficiently on GPUs.