Real-Time Prediction Serving, Simplified

Prediction serving infrastructure today is difficult to use and difficult to manage. Cloudflow is a dataflow DSL built on top of the Cloudburst stateful FaaS platform that enables users to easily construct, deploy, and scale prediction serving pipelines. These pipelines are constructed from familiar dataflow operators like map and filter. Cloudflow automatically optimizes pipelines using techniques like operator fusion and competitive execution — combined with the serverless infrastructure, this allows Cloudflow to significantly outperform state-of-the-art prediction serving systems while simplifying the process of building pipelines. You can find the full paper here and the code here.

Background

The spread of machine learning over the last decade has been fueled by tools that make it easier, faster, and cheaper to construct models at large scales — TensorFlow, PyTorch, SciKit-Learn, and so on. Indeed, the simplicity and power of these systems is partly responsible for the recent explosion in AI and the rapid improvements in model accuracy.

Why is Prediction Serving Harder Than Training?

Prediction serving is challenging because it combines the high-variance workload burstiness of online applications with extremely compute-intensive tasks. Online applications have notoriously bursty workloads, illustrated by Twitter’s usage graph from the 2010 World Cup. The challenge of bursty workloads is amplified by the fact that ML models require expensive infrastructure — most commonly, GPUs with potentially large memory requirements to run models. Running a cluster of just 10 GPUs can cost over $3,500 a month on AWS.

  • Deploying pipelines often requires hacking around the systems’ supported APIs: None of these systems support pipelines with parallelism, which is an important requirement for real-world pipelines executing multiple models. A pipeline with, say, 3-way parallelism would have to deploy three different Sagemaker pipelines and build a separate proxy service to route requests correctly.
  • Managing pipelines is difficult as the infrastructure for these systems is fixed-deployment, meaning resource allocation must be managed manually. This is particularly difficult for data scientists, many of whom might not be interested in or skilled at deploying and operating scalable online services.

Live Prediction Serving with Familiar Data Pipeline APIs

The systems community has long embraced dataflow as an effective way to construct multi-stage pipelines of operations, dating at least as far back as transaction processing systems in the 80s and SEDA at the turn of the century. Better yet, these pipelines can be optimized with well-known techniques like operator fusion (merging multiple logical operators into a single physical one to reduce data movement). This makes dataflow a natural fit for prediction pipelines.

  • Now supporting execute-many-pick-one semantics for competitive execution.
  • Added continuations, so pipelines with dynamic lookups could leverage Cloudburst’s locality-aware scheduling.
  • Added batching support, especially for GPUs.

Looking Forward

We’re really excited about the initial progress we’ve made with these dataflow-based prediction pipelines. Moving forward, we’d like to get a better sense for how data scientists interact with their existing infrastructure, and how serverless pipelines can help simplify their lives. In addition to what we’ve talked about here, we’re also interested in how folks are dealing with challenges like type checking and monitoring, and how those solutions integrate with compute infrastructure. If you have opinions or thoughts and would like to chat more, please reach out by email or on Twitter!

Working on distributed systems and serverless things in grad school @ Cal.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store