How to Scale Up Your FastAPI Application Using Ray Serve

Published in

Distributed Computing with Ray

5 min readDec 8, 2020

UPDATE 7/8/21: Ray Serve now has a seamless integration with FastAPI built in! Check it out on the documentation page: https://docs.ray.io/en/master/serve/http-servehandle.html#fastapi-http-deployments

UPDATE 11/9/21: The code samples below use an older Ray Serve API that has been removed in Ray 1.7+. Although the main ideas of the article are still valid, we recommend using the FastAPI integration linked above.

The Ray Serve and FastAPI logos joined by a plus sign

FastAPI is a high-performance, easy-to-use Python web framework, which makes it a popular way to serve machine learning models. In this blog post, we’ll scale up a FastAPI model serving application from one CPU to a 100+ CPU cluster, yielding a 60x improvement in the number of requests served per second. All of this will be done with just a few additional lines of code using Ray Serve.

Ray Serve is an infrastructure-agnostic, pure-Python toolkit for serving machine learning models at scale. Ray Serve runs on top of the distributed execution framework Ray, so we won’t need any distributed systems knowledge — things like scheduling, failure management, and interprocess communication will be handled automatically.

The structure of this post is as follows. First, we’ll run a simple model serving application on our laptop. Next, we’ll scale it up to multiple cores on our single machine. Finally, we’ll scale it up to a cluster of machines. All of this will be possible with only minor code changes. We’ll run a simple benchmark (for the source code, see the Appendix) to quantify our throughput gains.

To run everything in this post, you’ll need to have either PyTorch or TensorFlow installed, as well as three other packages which you can install with the following command in your terminal:

pip install transformers fastapi "ray[serve]"

Introduction

Here’s a simple FastAPI web server. It uses Huggingface Transformers to auto-generate text based on a short initial input using OpenAI’s GPT-2 model, and serves this to the user.

Let’s test it out. We can start the server locally in the terminal like this:

uvicorn main:app --port 8080

Now in another terminal we can query our model:

curl "http://127.0.0.1:8080/generate?query=Hello%20friend%2C%20how"

The output should look something like this:

[{“generated_text”:”Hello friend, how much do you know about the game? I’ve played the game a few hours, mostly online, and it’s one of my favorites… and this morning I finally get to do some playtesting. It seems to have all the”}]%

That’s it! This is already useful, but it turns out to be quite slow since the underlying model is very complex. On my laptop, this request takes about two seconds. If the server is being hit with many requests, the throughput will only be 0.5 queries per second.

Scaling up: Enter Ray Serve

Ray Serve will let us easily scale up our existing program to multiple CPUs and multiple machines. Ray Serve runs on top of a Ray cluster, which we’ll explain how to set up below.

First, let’s take a look at a version of the above program which still uses FastAPI, but offloads the computation to a Ray Serve backend with multiple replicas serving our model in parallel:

Here we’ve simply wrapped our NLP model in a class, created a Ray Serve backend from the class, and set up a Ray Serve endpoint serving that backend. Ray Serve comes with an HTTP server out of the box, but rather than use that, we’ve just taken our existing FastAPI /generate endpoint and give it a handle to our Ray Serve endpoint.

Here’s a diagram of our finished application:

Testing on a laptop

Before starting the app, we’ll need to start a long-running Ray cluster:

ray start --head

The laptop will be the head node, and there will be no other nodes. To use 2 cores, I’ll set num_replicas=2 in the code sample above. Running the server exactly as before, if we saturate our backend by sending many requests (see the Appendix), we see that our throughput has increased from 0.53 queries per second to about 0.80 queries per second. Using 4 cores, this only improves to 0.85 queries per second. It seems with such a heavyweight model, we have hit diminishing returns on our local machine.

Running on a cluster

The really big parallelism gains will come when we move from our laptop to a dedicated multi-node cluster. Using the Anyscale beta, I started a Ray cluster in the cloud with a few clicks, but I could have also used the Ray cluster launcher. For this blog post, I set up a cluster with eight Amazon EC2 m5.4xlarge instances. That’s a total of 8 x 16 = 128 vCPUs.

Here are the results measuring throughput in queries per second, for various settings of num_replicas:

That’s quite a bit of speedup! We would expect to see an even greater effect with more machines and more replicas.

Conclusion

Ray Serve makes it easy to scale up your existing machine learning serving applications by offloading your heavy computation to Ray Serve backends. Beyond the basics illustrated in this blog post, Ray Serve also supports batched requests and GPUs. The endpoint handle concept used in our example is fully general, and can be used to deploy arbitrarily composed ensemble models in pure Python.

For more about Ray Serve, check out the following links:

If you would like to know how Ray Serve is being used in industry, you can check out the Ray Summit page for more details.

Appendix: Quick-and-dirty benchmarking

To saturate all of our replicas with requests, we can use Python’s asyncio and aiohttp modules. Here’s one way of doing it, based on this example HTTP client: