Real-Time Prediction Serving, Simplified
Prediction serving infrastructure today is difficult to use and difficult to manage. Cloudflow is a dataflow DSL built on top of the Cloudburst stateful FaaS platform that enables users to easily construct, deploy, and scale prediction serving pipelines. These pipelines are constructed from familiar dataflow operators like
filter. Cloudflow automatically optimizes pipelines using techniques like operator fusion and competitive execution — combined with the serverless infrastructure, this allows Cloudflow to significantly outperform state-of-the-art prediction serving systems while simplifying the process of building pipelines. You can find the full paper here and the code here.
The spread of machine learning over the last decade has been fueled by tools that make it easier, faster, and cheaper to construct models at large scales — TensorFlow, PyTorch, SciKit-Learn, and so on. Indeed, the simplicity and power of these systems is partly responsible for the recent explosion in AI and the rapid improvements in model accuracy.
However, trained models have little value unless they are deployed to render the predictions that drive content recommendation, churn prediction, fraud detection, automated support, user sales, and so on. With the exception of a few specialized tasks (e.g., click through rate prediction, ads), this process of serving predictions has received relatively little attention both in industry and academia. We believe that innovation in prediction serving technologies will enable the next inflection point in AI where the composition of models, feature stores, and custom logic in prediction pipelines will enable new skills and functionalities often without the need to train new models.
In this article, we overview recent work designing a new prediction serving system built on top of a serverless framework that is designed to support the next generation of cloud technologies.
Why is Prediction Serving Harder Than Training?
Prediction serving is challenging because it combines the high-variance workload burstiness of online applications with extremely compute-intensive tasks. Online applications have notoriously bursty workloads, illustrated by Twitter’s usage graph from the 2010 World Cup. The challenge of bursty workloads is amplified by the fact that ML models require expensive infrastructure — most commonly, GPUs with potentially large memory requirements to run models. Running a cluster of just 10 GPUs can cost over $3,500 a month on AWS.
Worse yet, as a part of interactive applications, predictions often need to be made within tight latency bounds (~100ms). But predictions don’t come from individual models — like most modern microservice-style applications, prediction serving pipelines pass requests through multiple stages. For example, an image uploaded to a message board might first be preprocessed and normalized, passed through an NSFW detector, then passed to a classifier that generates alt-text for the image before being posted to the forum.
Industrial systems like AWS Sagemaker and AzureML have basic solutions for deploying prediction pipelines, but they have some significant gaps:
- Building pipelines is cumbersome and unfriendly to open source — if a pipeline’s models weren’t trained on AWS Sagemaker, the user has to manually construct a Docker container for each model, along with a webserver that wraps the model and handles requests.
- Deploying pipelines often requires hacking around the systems’ supported APIs: None of these systems support pipelines with parallelism, which is an important requirement for real-world pipelines executing multiple models. A pipeline with, say, 3-way parallelism would have to deploy three different Sagemaker pipelines and build a separate proxy service to route requests correctly.
- Managing pipelines is difficult as the infrastructure for these systems is fixed-deployment, meaning resource allocation must be managed manually. This is particularly difficult for data scientists, many of whom might not be interested in or skilled at deploying and operating scalable online services.
To enable data scientists to be as effective as possible, prediction pipelines should be easy to build, easy to deploy, and easy to manage. Layering prediction pipelines on top of a serverless system like Cloudburst simplifies resource management — something we hinted at in our original post about Cloudburst. But there are still open questions on the other fronts.
Live Prediction Serving with Familiar Data Pipeline APIs
The systems community has long embraced dataflow as an effective way to construct multi-stage pipelines of operations, dating at least as far back as transaction processing systems in the 80s and SEDA at the turn of the century. Better yet, these pipelines can be optimized with well-known techniques like operator fusion (merging multiple logical operators into a single physical one to reduce data movement). This makes dataflow a natural fit for prediction pipelines.
Cloudflow is a dataflow DSL for prediction serving built on top of Cloudburst. Users construct pipelines by chaining together operators like
The DSL treats user functions as black boxes — meaning it’s flexible enough to support arbitrary libraries and algorithms — while using the dataflow model to gain insight into the structure of the compute pipelines. Cloudflow automatically rewrites these pipelines to implement optimizations like operator fusion and competitive execution (executing multiple replicas of a model in parallel to reduce tail latency).
When the user calls
deploy, the dataflow is automatically optimized and “compiled” into a Cloudburst DAG. From here on, requests can be made with calls to
execute, and Cloudburst manages scaling and resource allocation. Our initial results are exciting: We’re able to outperform a system like AWS Sagemaker by 2x on a real-world image processing pipeline and can meet latency requirements for extremely compute intensive tasks like video-stream processing. Optimizations like competitive execution can yield noticeable improvements even for relatively simple pipelines a neural machine translation example. A summary of our initial evaluation is shown below:
Meanwhile, deployment and management of a pipeline is far, far easier with Cloudflow than it is with other systems—upload a script and you’re ready to go, with the underlying serverless infrastructure taking care of autoscaling. On the other hand, running a pipeline on SageMaker requires building a Docker container, manually configuring instance types and cluster sizes, and setting up a proxy to access the service externally.
In addition to what we’ve talked about here, supporting prediction pipelines required a number of extensions to Cloudburst that I’ll briefly list here. If you’re interested in learning more, check out our full paper. The changes to Cloudburst are all in the main repository, and you can find the Cloudflow code here.
- Added GPU support.
- Now supporting execute-many-pick-one semantics for competitive execution.
- Added continuations, so pipelines with dynamic lookups could leverage Cloudburst’s locality-aware scheduling.
- Added batching support, especially for GPUs.
We’re really excited about the initial progress we’ve made with these dataflow-based prediction pipelines. Moving forward, we’d like to get a better sense for how data scientists interact with their existing infrastructure, and how serverless pipelines can help simplify their lives. In addition to what we’ve talked about here, we’re also interested in how folks are dealing with challenges like type checking and monitoring, and how those solutions integrate with compute infrastructure. If you have opinions or thoughts and would like to chat more, please reach out by email or on Twitter!
In addition to NSF CISE Expeditions Award CCF-1730628, this research is supported by gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware.