How to Scale Up Your FastAPI Application Using Ray Serve

Archit Kulkarni
Dec 8, 2020 · 5 min read

UPDATE 7/8/21: Ray Serve now has a seamless integration with FastAPI built in! Check it out on the documentation page: https://docs.ray.io/en/master/serve/http-servehandle.html#fastapi-http-deployments

UPDATE 11/9/21: The code samples below use an older Ray Serve API that has been removed in Ray 1.7+. Although the main ideas of the article are still valid, we recommend using the FastAPI integration linked above.

The Ray Serve and FastAPI logos joined by a plus sign

FastAPI is a high-performance, easy-to-use Python web framework, which makes it a popular way to serve machine learning models. In this blog post, we’ll scale up a FastAPI model serving application from one CPU to a 100+ CPU cluster, yielding a 60x improvement in the number of requests served per second. All of this will be done with just a few additional lines of code using Ray Serve.

Ray Serve is an infrastructure-agnostic, pure-Python toolkit for serving machine learning models at scale. Ray Serve runs on top of the distributed execution framework Ray, so we won’t need any distributed systems knowledge — things like scheduling, failure management, and interprocess communication will be handled automatically.

The structure of this post is as follows. First, we’ll run a simple model serving application on our laptop. Next, we’ll scale it up to multiple cores on our single machine. Finally, we’ll scale it up to a cluster of machines. All of this will be possible with only minor code changes. We’ll run a simple benchmark (for the source code, see the Appendix) to quantify our throughput gains.

To run everything in this post, you’ll need to have either PyTorch or TensorFlow installed, as well as three other packages which you can install with the following command in your terminal:

pip install transformers fastapi "ray[serve]"

Introduction

Let’s test it out. We can start the server locally in the terminal like this:

uvicorn main:app --port 8080

Now in another terminal we can query our model:

curl "http://127.0.0.1:8080/generate?query=Hello%20friend%2C%20how"

The output should look something like this:

[{“generated_text”:”Hello friend, how much do you know about the game? I’ve played the game a few hours, mostly online, and it’s one of my favorites… and this morning I finally get to do some playtesting. It seems to have all the”}]%

That’s it! This is already useful, but it turns out to be quite slow since the underlying model is very complex. On my laptop, this request takes about two seconds. If the server is being hit with many requests, the throughput will only be 0.5 queries per second.

Scaling up: Enter Ray Serve

First, let’s take a look at a version of the above program which still uses FastAPI, but offloads the computation to a Ray Serve backend with multiple replicas serving our model in parallel:

Here we’ve simply wrapped our NLP model in a class, created a Ray Serve backend from the class, and set up a Ray Serve endpoint serving that backend. Ray Serve comes with an HTTP server out of the box, but rather than use that, we’ve just taken our existing FastAPI /generate endpoint and give it a handle to our Ray Serve endpoint.

Here’s a diagram of our finished application:

Testing on a laptop

ray start --head

The laptop will be the head node, and there will be no other nodes. To use 2 cores, I’ll set num_replicas=2 in the code sample above. Running the server exactly as before, if we saturate our backend by sending many requests (see the Appendix), we see that our throughput has increased from 0.53 queries per second to about 0.80 queries per second. Using 4 cores, this only improves to 0.85 queries per second. It seems with such a heavyweight model, we have hit diminishing returns on our local machine.

Running on a cluster

Here are the results measuring throughput in queries per second, for various settings of num_replicas:

That’s quite a bit of speedup! We would expect to see an even greater effect with more machines and more replicas.

Conclusion

For more about Ray Serve, check out the following links:

If you would like to know how Ray Serve is being used in industry, you can check out the Ray Summit page for more details.

Appendix: Quick-and-dirty benchmarking

Distributed Computing with Ray

Ray is a fast and simple framework for distributed…