Clipper: High throughput, Low latency, real time inference engine


Jan 8 · 8 min read

Classical ML applications generally consist of two phases. First phase is using a lot of data to train models and the second phase involves drawing inference from previously unseen data. A lot of previous work in ML infrastructure space has been about scaling the first phase i.e. model training — This is where a lot of compute is spent and previously discussed papers such as parameter server focus on using distributed systems for scaling purposes. There hasn’t been similar focus on inference or prediction scaling where a a big request flood might overwhelm the ML inference or prediction system. Clipper focuses on this second phase of the ML workload. In addition, clipper provides an ability to deal with ever growing list of ML frameworks such as TensorFlow, Caffe, Spark, Scikit-learn by abstracting those frameworks out into a common layer. Since the intended use-case is for real-time applications, Clipper reduces and places a bound on latency while maximizing throughput. Clipper also provides novel ways to select a particular model(or use ensemble) to improve the accuracy and reliability of predictions.

Challenges with real-time inference system

  1. There are a large number of existing frameworks to deal with and more being developed. This poses substantial challenge for application developers. In addition, machine learning applications use modify-deploy-modify iterations all the time. Hence one may need to change to use a much bigger dataset and hence use a distributed model training framework which may not have been necessary when the model development started. Sometimes developers may have to deal with many frameworks, since some frameworks are only optimized for model training. Clipper addresses these issues by adding a model abstraction layer and providing a common interface between model abstraction and various frameworks.
  2. For real-time applications, bounded latency, in orders of few milliseconds, is important. Most ML frameworks are optimized for offline batch training, but not for batch inference. Batched queries for inference can benefit for BLAS or GPU/SIMD, but current ML training frameworks are only optimized for point queries. Some complex ML applications combine many models to improve accuracy, but it comes at the cost of increased tail latencies and straggler models that may not converge quickly. Clipper improves latency by adaptively batching queries and maintaining bounds on latency. Clipper also implements straggler mitigations to reduce latency at the cost of reduced accuracy.
  3. Many applications test out multiple models and try to choose the best one by either doing offline testing on stale data or online A/B testing. Offline analysis can be inaccurate due to reliance on stale data, while online A/ testing leads to selection on a static model that may not perform well in the future either due to feature corruption or concept drift. In addition, some models may be more suited for certain regions or certain users(consider speech). So Clipper provides an online model selection algorithm. It also improves accuracy by using prediction from ensemble of models.

System architecture

Clipper architecture abstracting multiple ML frameworks and model selection dealing with end user applications

To trace a request through clipper, let’s start with the top layer. Front-end application issues an inference query to the clipper framework. Model selection layer then selects model(s) based on available feedback for this application/query and dispatches the query to the model abstraction layer. Model abstraction layer then looks for cached results and returns if it has those in the cache. If not, then inference query is added to the queue for the given model container. This queue adaptively batches queries based on the model and framework before sending it to the model. Once results come back, cache is populated with the results. Model selection layer then combines results to give out final prediction and also provides a confidence estimate if more than one model was used. This final result is used by the front-end application and that provides feedback to the model selection layer for improving future selection of models.

Model abstraction layer:

Model abstraction layer consists of two important functions — Caching and batching.

Caching is done using two APIs. First one is “Request”. When inference is needed, a query is assigned to cache using this non-blocking “Request” call. If results are in the cache then they are returned. Otherwise, the API indicates that the result is not available in the cache and the cache issues the query to the model. Another call “Fetch” can be used to check for availability of results — almost like in a polling manner. Clipper uses LRU for cache eviction. Caching is really useful for returning answers to popular content and offload some work from models.

Batching is done to improve throughput of machine learning frameworks by amortizing cost of multiple RPCs and by data parallel optimizations and better utilizing system resources — such as copying data to GPU memory. Batching layer knows about the ML framework that it is dealing with and batches queries accordingly. Throughput in this case comes at the expense of latency since the batching layer now needs to wait for all results to come back. This is addressed by asking for specific SLOs for latencies. Applications define this SLO and batching layer maximizes throughput using this SLO as a constraint. While batching is adaptive, maximum batch size is used to meet the latency SLOs.

Adaptive batching works to find the right size for the given batch. One way clipper does this is by using “Additive increase multiplicative decrease” (AIMD) policy. That is clipper keeps on increasing batch size in constant increments while SLOs are being met. As soon as latency SLO misses, the batch size decreases multiplicatively. This strategy is simple and works very effectively and is the default that clipper provides. Another approach that the authors explored was to use a quantile regression in the 99 the percentile of tail latencies and set the max batch size accordingly. This performs similarly to AIMD and is not as simple to implement due to computational complexity associated with quantile regressions.

Delayed batching is useful when workload is bursty or moderate. In such cases, the optimal batch size chose for the given framework may not be reached. In such cases, batches can be delayed by a few milliseconds so as to accumulate more queries. Such approach benefits some frameworks such as scikit-learn which uses BLAS for optimizations. Data from the paper shows 3x improvement for scikit-learn by delaying batches upto 2 ms.

Model Selection Layer:

Model selection layer allows for dynamically selecting models that yield good results. Quality of results can be established using some feedback from the client applications. In addition, combination of multiple models can disregard failing models. Clipper continuously learns the feedback from the performance of the models and hence obviates the need for manual intervention for model selection.

Clipper provides a State data structure(S) that can be used for contextualizing model selection policy, a model selection API that depends on state S, a way to combine results of the query that can yield best results(can also provide confidence score on the results) and the “observe” API that can update state S to reflect the feedback. As for the model selection, clipper provides two algorithms off the bat. One is computationally cheaper — exp3 which depends on selecting a single model and another one is ensemble approach which combines results from multiple models, but at the cost of more computation. Using these approaches, application developers can select appropriate policy depending on their needs.

Single model selection

In this approach, only one model is chosen at a time for the given query. Selecting a model depends on how the model has performed before. Similar to k-bandit problem in reinforcement learning, there is an explore and exploit tradeoff. Exp3 chooses a model and based on the error feedback, updates the probability of selecting the same model again. Exp3 is much more efficient compared to manual experimentation or A/B testing. Some details on exp3 are as follows:

If si is the confidence with which a model is chosen and L(y, yˆ) is the loss function between obtained output versus the expected one, the new confidence is updated as: si = si * exp(−ηL(y, yˆ)/pi), here η can be considered as learning rate or how quickly the si converges to an optimal solution. The probability of selecting a particular model is given by: si / (summation of si’s for all models)

Ensemble model selection

Ensembles(multiple models) have been pretty effective in providing accurate results. Many Kaggle competitions are won by ensembles. The overall idea is that instead of using a single model, many models trained on different subsets of data can be more effective in generalizing and predicting. Clipper uses linear models with weighted average of base model predictions. Clipper relies on exp4 algorithm which out of scope of discussion of this paper. This ensemble approach obviously comes at the cost of computation because of the need to evaluate multiple base models.

Having looked at both exp3 and exp4, model selection comes in very handy if model degrades for some reason. Due constant feedback loop in the clipper, it will choose some other mechanism or model and reduce the error rate. Below are the results of an experiment in which authors simulated errors and clipper responded by keeping the overall error rate low. In this case, errors were introduced in the model 5 between 5K to 10K queries. While error rate suffers drastically for model 5, both exp3 and exp4 divert queries to other models and maintain relatively low overall error rate. Once model 5 recovers, error rate improves further in both exp3 and exp4 methods.

Model 5 error rate spikes between 5k and 10k. Both exp3 and exp4 still keep the overall error rate low

Handling stragglers:

With ensemble methods, straggles(models that take a long time) get introduced which can affect tail latencies. To address this, Clipper makes a design choice to give out an inaccurate prediction instead of giving out an accurate prediction late. This is done by maintaining SLO on latency and if the query doesn’t return, then low confidence is conveyed back to issuer. This best effort straggler mitigation strategy obviously reduces the size of the ensemble.


Context can improve results of learning quite a bit. A model trained for certain dialect would perform better for certain users or a region. The model selection state S that was introduced earlier helps with that. It can be instantiated for every user, session or some context. This context related state is stored externally in a Redis database.


Clipper provides a ML-framework agnostic approach for inference. This in itself can be a great blessing in the evolving landscape of ML frameworks. In addition, built-in advantages of model selection are pretty useful for fault tolerance in models and also for improving accuracy. Latency SLOs ensure very high throughput needed at the time of inference.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade