Why we switched from Flask to FastAPI for production machine learning

The most popular tool isn’t always the best

5 min readJun 11, 2020

To productionize a machine learning model, the standard approach is to wrap it in a REST API and deploy it as a microservice. Flask is currently the de facto choice for writing these APIs for a couple of reasons:

Flask is minimal. Because inference APIs have historically tended to be simple predict() methods, the complexities introduced by more opinionated frameworks (like Django) have been seen as unnecessary.
Flask is written in Python, which is the standard language of machine learning. All major frameworks have Python bindings, and virtually all data scientists/machine learning engineers are familiar with it.

When we were selecting a framework to use under the hood of Cortex, our open source model serving platform, we picked Flask for these same reasons. However, with the release of version 0.14, we switched from Flask to FastAPI.

Several releases later, we’re very pleased with our decision.

Below, I’ve gone in depth on the core reasons we switched from Flask to FastAPI. If you’re curious about using something other than Flask for building your inference APIs, this will hopefully provide some useful context.

Note: If you are unfamiliar, FastAPI is a Python API microframework built on top of Starlette and Uvicorn.

1. ML inference benefits from native async support

We initially began looking for alternatives to Flask because of issues we were running into with autoscaling.

Rearchitecting autoscaling within Cortex is its own story, but the high-level is that Cortex used to autoscale by measuring CPU utilization, but now autoscales according to how many incoming requests an API has.

In order for this to work, Cortex needs to be able to asynchronously count queued and in-flight requests, and Flask—being designed for WSGI servers like Gunicorn—doesn’t have native async support.

The easiest solution, for us, was to switch to a framework with native async support. Being built on top of Uvicorn, an ASGI server, FastAPI made it easy to run an async event loop that counts incoming requests.

But even beyond solving our autoscaler issues, async support has enabled us to begin working on more complex inference features.

For example, Cortex allows users to write their own request handling code using a Predictor interface. The interface is a Python class that provides methods for initializing a model file and generating predictions:

But some users need to include operations beyond generating predictions in their predict() method—saving a file from S3, logging predictions to an external service, etc.

Ideally, these tasks wouldn’t run in predict(), as they add to inference latency. We’re currently working on implementing pre- and post-predict hooks, which will allow users to asynchronously run operations that don’t need to block the actual inference for the request.

2. Improved latency is a huge deal for inference

Latency and throughput are always important, but in production machine learning, their value is emphasized.

For example, if Uber’s ETA prediction is a few seconds late on your location or on traffic data, its utility decreases significantly. Similarly, if Gmail’s Smart Compose suggests text slower than you type, the feature has little value.

Because of this, every improvement we can make in overall latency and throughput is valuable, even seemingly minor ones.

FastAPI, as its name suggests, is one of the fastest Python frameworks, outperforming Flask by over 300%:

For most deployments, the speed of the underlying framework is not the largest factor in determining inference latency. However, when you consider the cost of improving latency, it is clear that any improvement is valuable.

For example, Smart Compose needs to serve predictions at under 100ms. Even after designing a model specifically for faster predictions, the team couldn’t hit this threshold. They had to deploy on cloud TPUs—the smallest of which is $4.50/hour on-demand—in order to get latency under 100ms.

In that context, improving the speed of the underlying framework can have a large benefit. Even a small decrease in latency can prevent a team from needing more expensive hardware.

3. FastAPI is easy to switch to—by design

There are other frameworks faster than Flask that have native support for async. Our decision to choose FastAPI over the rest of them, while still largely motivated by its technical advantages, was heavily impacted by its low switching cost.

For context, we’re a small team. Cortex has an open source community with amazing contributors, but only four of us work on it full time. Because our engineering hours are precious, switching costs are a major consideration for us anytime we consider a change.

One of FastAPI’s selling points is that it is by design very similar to Flask in terms of syntax. For example, this is a snippet of routing code from Cortex v0.13, when it was built on Flask:

And here is the equivalent code in v0.14, when we first transitioned to FastAPI:

The initial transition from Flask to FastAPI required a surprisingly minimal amount of rewrites (ignoring features like autoscaling, for which we were introducing new designs).

Obviously, if another framework offered a dramatic performance advantage over FastAPI, we wouldn’t select FastAPI solely because of its ease of adoption. But, with no framework being significantly faster than FastAPI, its ease of adoption was even more reason for us to select it over others.

Balancing minimalism and maturity in production machine learning

Another interesting thing that we’ve seen since switching to FastAPI is that some of the features we initially wrote off as “nice to haves”—data validation, improved error handling, etc.—have actually proven to be valuable to our users.

This, in my opinion, reflects a broader trend within ML.

In the past, there weren’t many teams deploying models as real-time production APIs. For most data science teams, Flask was “good enough” in that it was popular, minimal, and written in Python.

But production ML, as a field, has matured. It’s increasingly common for companies to have at least one model in production. As more teams deploy models, the conversation around tooling has shifted from “What gets the job done?” to “What does it take to deploy a model at production scale?”

This maturation is the same reason we built Cortex in the first place. Years ago, data science teams could get by kludging together a “good enough” deployment process. As the field has matured, however, real infrastructure features—rolling updates, autoscaling, prediction monitoring, etc.—have gone from being “nice to haves” to being essential.

The models teams are deploying are getting bigger. The applications they’re building are more complex. The traffic these models are handling is increasing. With all of these challenges, the definition of a “good enough” solution is changing, and mature tooling is becoming essential.