Machine Learning Serving is Broken

And How Ray Serve Can Fix it

Simon Mo
Distributed Computing with Ray
7 min readJul 20, 2020


NOTE: As of Ray 1.4, Ray Serve has a new-and-improve API focused around the concept of "Deployments" (replacing "Backends" in this post). This change was made to make use cases like the one in this blog post easier than ever! Please see the latest documentation at for details.

TL;DR: The go-to solution for ML serving is to wrap your model inside a Flask microservice. When that’s not enough, people turn to specialized solutions like Tensorflow Serving or TorchServe. However, both of these approaches are inadequate:

  • Flask is simple but becomes unmanageable when scaling with multiple models.
  • Specialized solutions like Tensorflow Serving and TorchServe can scale but add significant complexity because they must be used in conjunction with traditional web servers.

To address these issues, we released Ray Serve. Ray Serve is a new model serving framework we are building to simplify the process of serving machine learning in production.

Where Serving Comes In

You are a data scientist or ML engineer who just trained a state of the art model for your company. You show the accuracy and ROC curve to your boss and she is absolutely thrilled about the new model. “Great!” she says, “now can we put this in production?”

ML serving is the last piece of the ML lifecycle. Models have no value unless you deploy them.

You know your technical stack well: scikit-learn, pandas, NumPy, and other libraries, but none of these cover the process of deploying a machine learning model in production.

“Well, can our client query the model in real time? Why don’t you figure that out!”

Challenge accepted. You search for how to enable users to interact with the model in real time. If you are using Python, the most popular result screams “just wrap your model in Flask.”

Wrap Your Model in Flask

On the surface, serving a machine learning model should be exactly like serving some web function that queries a database. After all, a web request comes in, it gets sent to some black box (the machine learning model or the database), the server gets the result from the black box, and then sends out a web response. It should be simple!

Web serving and ML serving looks alike on the surface.

Unfortunately, under the surface they are drastically different. ML models are notoriously compute intensive. They often have response times in the tens of milliseconds or higher. A single query can use 100% of a CPU or GPU and a lot of memory. While a single web server accessing a database can achieve at least 10,000+ queries per second, a web server serving machine learning models may achieve only 10–100 queries per second. Furthermore, models are generally loaded once per serving process. If the state is corrupted, the entire server process goes down.

Web serving and ML are drastically different under the surface.

Because ML models have low throughput compared to web servers, they require scaling out in real deployments. But this quickly poses its own challenges. How many instances of each model do we need? How do we route requests to the correct instance?

Compared to web servers, model servers have challenges in service discovery and load balancing.

You might ask two questions:

  • Why can’t we just initialize all the models in each serving process? Unfortunately, ML models are memory intensive and they typically need to load 50–2000 MBs of numeric weights into memory before serving. Putting all the models inside a single serving process, and replicating that process is not an option.
  • What about using an existing tool to determine which servers to talk to (like consul, load balancers, service meshes)? This approach works when operating several models and quickly becomes unmanageable for a single team with hundreds of models. After all, these tools are built for microservices each consisting of several HTTP endpoints. In the model serving scenario, each microservice only corresponds to one single model. A typical development team can manage tens of microservices but not hundreds of them without a large dedicated ops team.

Web servers do have one benefit in that they are beginner friendly and have a universal API: HTTP. When data scientists write a flask wrapper for their models, the data scientist owns the end to end request handling flow. The data scientist defines the API schema and the best way to handle the request. For example, for an image classification application, the data scientist has the liberty to take in either raw image bytes, or N-dimensional arrays, or any preferred input schema.

The Rise of Specialized Systems

Because wrapping your models in traditional web servers is not enough, cloud vendors and framework developers have been working on specialized model servers.

Specialized systems move the computation off the web servers.

The specialized model servers package existing models and serve them in their own APIs. This approach allows traditional web servers to interact with models in the same way that they interact with databases, that is, through HTTP connections to a specialized model server.

Specialized systems help model serving looks more like traditional web serving.

However, these model servers are not flexible. Developers are constrained by the specialized server’s API and often these APIs are “tensor-in, tensor-out.” That is, the developer must implement input and output transformation logic turning the user’s HTTP request into an N-dimensional array or another specific schema before sending it to the model server.

When specialized systems are adopted, request handling and inference are split between two services.

Additionally, by adopting model servers, data scientists now have two problems: managing the model servers, and managing a web server that handles all the business logic. Even though the model servers might be managed by an infrastructure team, the data scientists still need to think about the featurization in the web server and the model logic in the model servers. This is a lot of cognitive load for the data scientists.

Furthermore, each model server comes with its own drawbacks:

  • Framework lock-in: Tensorflow Serving and TorchServe are specialized systems for each framework. Modern data science teams use the best tool at hand; this can mean a mix of XGBoost, scikit-learn, and deep learning frameworks. We need a system that can serve arbitrary python models.
  • Vendor lock-in: AWS SageMaker and the other cloud providers offer hosted ML serving solutions that wrap your models and deploy them for you. In addition, these hosted solutions don’t have a unified API. We need vendor neutral solutions that avoid cloud vendor lock-in.
  • Training & serving divergence: There are other solutions that take a trained model and convert it to another format for serving, like ONNX, PMML, and NVIDIA TensorRT. But we want to serve their models in the same framework that was used for training to avoid unknown bugs.

Is there a Better Way?

Ray Serve is a new model server that solves these issues by giving the data scientist end-to-end control over the request lifecycle while letting each model scale independently. Ray Serve does this by leveraging Ray, an open source framework for building distributed systems. Here’s the interface you need to write for the model:

Define and run your service in 5 lines of code. Source

If you need to load a model and configure it ahead of time, give Ray Serve a class:

Loading a PyTorch model in class initialization, and serve the request. Source

Ray Serve’s Python API is framework agnostic and enables you to use the same framework you trained your model in. Ray Serve runs on top of Ray; Ray is vendor neutral and it has built-in support for deployment in public clouds (AWS, GCP, Azure), Kubernetes, YARN, SLURM, as well as on-premise machines.

It is easier than ever to scale your machine learning model with Ray Serve. Because Ray lets you program the cluster like a single machine, Ray can automatically schedule your model across many machines:

Ray Serve helps you scale out your model services and increase throughput with GPU and batching

Ray Serve also has features to address production challenges like rolling update, zero-downtime deployment, model composition. Just-in-case your model is network bounded, Ray Serve also supports async Python. You can checkout this blog post that shows end-to-end examples of deploying Hugging Face NLP model.

How are users using Ray Serve today?

  • A computer vision startup is using Ray Serve to serve deep learning based computer vision models. The teams of data scientists are iterating quickly without an ops team. Ray Serve helps them to quickly deploy and scale their predictions.
  • The data science team at an E-commerce site is using Ray Serve to gain full control of the models from development to deployment. Ray Serve helps them gain visibility into the serving process as well as reducing costs by batching on GPUs. Additionally, machine learning models are typically not deployed individuality. Ray Serve enables them to easily compose models together.
  • Your use case! Reach out to us on github, slack, and email. We are eager to help you to address your ML serving problem with Ray Serve.

To Learn More