Scaling Seldon’s Alibi with Ray — Making Model Explainability Easy and Scalable
By Bill Chambers, Anyscale and Alexandru Coca, Seldon
Serving models at scale is only part of the battle. In general, you’re going to have to be able to explain predictions as well. The team at Seldon knows the challenges of working with a number of models and serving them in production. However, making sure that users understand why those predictions are made is another challenge entirely.
For this reason, Seldon started the Alibi library. Alibi is an open-source Python library for ML model inspection and interpretation. It allows users torun popular model explainability algorithms such as Kernel SHAP on their data.
However, while Alibi has a number of algorithms, it’s often that users seek better performance and scaling. The Seldon team came to Ray, an open-source framework that provides a simple universal API for building distributed applications, in search of a solution for scaling Alibi and wanted to test out how Ray performed using the Kernel SHAP algorithm.
A preview of the results can be seen below. In short, the Seldon team was able to achieve linear scaling of Kernel SHAPm by leveraging Ray Core, in just a few hundred lines of code.
This post will review these results as well as discuss how the team went about implementing the Distributed Kernel SHAP implementation. Be sure to check out the GitHub repository for the full code and results as well.
Let’s dive a bit deeper into Alibi, Kernel SHAP, how Seldon approached distributing Kernel SHAP, and the results achieved.
What is Alibi?
Alibi is a Python package designed to help explain the predictions of machine learning models and gauge the confidence of predictions. The focus of the library is to support the widest range of models using black-box methods where possible. Alibi currently ships with 8 different algorithms for model explanations including popular algorithms like anchors, counterfactuals, integrated gradients, Kernel SHAP, and Tree SHAP.
Recently, the Seldon team worked on a project where they needed to be able to scale the explainability computations over a large number of CPUs. This is necessary since black-box methods model the prediction function post-hoc and are slow, making the task of explaining large numbers of instances prohibitively expensive on a single CPU.
The Seldon team figured they could explore utilizing a larger machine, but many of their customers operate in distributed environments and therefore need consistently scalable solutions — beyond a single machine.
They came to the Ray team (join us on Slack!) asking about how they might go about scaling this workload. With pretty minimal effort, the Seldon team was able to scale Kernel SHAP in just a few short days.
Background on Kernel SHAP
As mentioned in the documentation, Kernel SHAP provides model-agnostic (black box), human interpretable explanations suitable for regression and classification models applied to tabular data. This method is a member of the additive feature attribution methods class; feature attribution refers to the fact that the change of an outcome to be explained (e.g., a class probability in a classification problem) with respect to a baseline (e.g., average prediction probability for that class in the training set) can be attributed in different proportions to the model input features.
A simple illustration of the explanation process is shown in Figure 1. Here we see depicted a model which takes as an input features such as Age, BMI or Sex and outputs a continuous value. We know that the average value of that output in a dataset of interest is 0.1. Using the Kernel SHAP algorithm, we attribute the 0.3 difference to the input features. Because the sum of the attribute values equals output — base rate, this method is additive.
We can see, for example, that the Sex feature contributes negatively to this prediction whereas the remainder of the features have a positive contribution. For explaining this particular data point, the Age feature seems to be the most important. See our examples on how to perform explanations with this algorithm and visualize the results using the SHAP library visualizations here, here and here.
The challenge: Global Explanations with Kernel SHAP
As explained above, Kernel SHAP is a method for explaining black box models on tabular. However, computing these explanations is very expensive, so explaining many model predictions to obtain a global view of the model behavior on a single CPU or machine is challenging since it entails explaining a large number of predictions. However, since the predictions are independent, you can parallelize the computation of the explanation for each prediction, which itself could result in a significant time saving given an efficient distributed computation framework. This is where Ray comes in.
In a few short days, the team was able to go from concept to prototype.
There are two architectural variations of the solution that the Seldon team built. The core architecture is simple, they created a pool of worker processes and then passed computation function to them to be parallelized and subsequently executed. However, there are two ways to create this pool.
One method uses a pool of Ray Actors, which consume small subsets of the 2560 model predictions to be explained. The other method uses Ray Serve instead of the parallel pool and we submit work to Ray Serve as a batch processing task. The code for both methods are available in the GitHub repository and both had similar performance results.
Single Node Results
The experiments were run on a compute-optimized dedicated machine in Digital Ocean with 32vCPUs. This explains the performance gains attenuation below.
The results obtained running the task using the ray parallel pool are below:
Distributing using Ray Serve yields similar results:
The experiments were run on a cluster consisting of two compute-optimized dedicated machines in Digital Ocean with 32vCPUs each. This explains the performance gains attenuation below.
The results obtained running the task using the ray parallel pool over a two-node cluster are shown below:
Distributing using Ray Serve yields similar results:
The team experimented with different batch sizes. The batch size controls how many predictions are sent to one worker simultaneously. The primary purpose in doing so was to ensure that the runtime of each explanation task is greater than the overhead of distributing the task; the explainer does not benefit from processing batches since each instance is explained independently of other instances in the batch. While the team did not conduct a detailed analysis of why larger batch sizes incur a slight runtime penalty, we suspect that distributing fewer explanation tasks that take longer to run could lead to inefficiencies in resource utilization (e.g., towards the end there are a few tasks left to execute but some workers are idle).
What to look for going forward with Seldon and Ray
Seldon and Ray make for a great combination. This post demonstrates how easy it was for Alibi to parallelize a given workload on Ray and in the future we look forward to sharing how these toolkits can be used together for everything from model training to serving.
You can also check us both out at Ray Summit. Ray Summit is a FREE Virtual Summit for all things Ray related!