Beyond “Hello World”: Modern, Asynchronous Python in Kubernetes

Deploying Scalable, Production-Ready Web-Services in Python 3 on Kubernetes

Sean Stewart
Jul 22, 2019 · 7 min read

Python has undergone something of an evolution in the past few years. From Python 3.4 to 3.7 we have seen the introduction of asyncio, the introduction and formalization of async/await keywords, and re-investment in asyncio performance. Writing asynchronous code in Python has never been easier, more performant, or more efficient.

In addition to the improvements to the stdlib, Python’s Open-Source community has entered something of a Renaissance as well. The Open-Source community has embraced the potential of async/await and the flexibility of the asyncio library has proven a huge boon. asyncio’s extensible API readily encourages alternative Event Loop implementations and we now have libraries like uvloop, which is an asyncio compatible implementation of the Event Loop using libuv under the hood. Additionally, when it comes to web frameworks, there couldn’t be more options to choose from, and there are decisive benchmarks out there pitching all of them head-to-head.

However, when it comes to building a new application, there is decidedly little chatter about what these benchmarks mean for you in the context of how your application will be deployed. Will it be deployed via a cloud-based Virtual Machine? Directly to a server? What about Kubernetes or Docker Swarm?

At Xandr, we’ve gone all-in on Kubernetes. If you’re reading this post, odds are you’re in a similar boat. With that in mind, this post investigates 3 different deployment configurations of an otherwise identical application in an attempt to determine the ideal deployment configuration for your asynchronous web service.


This post assumes:

  1. You’ve already made the (wise) decision to use an async framework for your web service

For the purposes of this post, I chose the aiohttp framework for its maturity and stability, but the general rules provided here should be applicable to any open-source framework on the market today.

It’s all about Scaling

When we talk about scaling, we generally refer to one of two major approaches:

  1. Horizontal Scaling — scaling across machines and/or environments

What We’re Testing

More traditional deployments require a mix of vertical and horizontal scaling, with an emphasis on vertical — by way of maximizing the use of available CPU cores on your machine. For Python web-services, that usually means running your application behind Gunicorn or another similar solution in production. I agree that for these environments, this is definitely the appropriate strategy.

When your application is deployed on Kubernetes, it runs in the foreground on small Docker containers scheduled in Pods with a fraction of a CPU and minimal memory. Kubernetes takes advantage of horizontal scaling. Rather than ramp up the number of threads or worker processes on a single Pod, you scale the number of Pods to meet demand.

If done properly, developing and deploying with Docker can provide us with a very powerful guarantee:

  • The application run-time you develop and debug with is the application run-time you deploy.

With this in mind, I set out to determine the following question:

  • In the context of Kubernetes, is the addition of a run-time dependency for deployment worth the additional overhead and/or risk?

Application Implementation & Design

I implemented a simple, RESTful API supporting GET/PUT/POST/DELETE using the following libraries:

  1. Server: aiohttp

Additionally, I installed the following libraries to improve overall performance:

  1. aiodns (via aiohttp[fast])

aiodns & cchardet are used automatically by aiohttp if they're available, so are a no-op. uvloop can be invoked by running uvloop.install() at the top of your (or under your ifmain if you prefer). Be sure you're not creating a global for your loop before doing this (or ever, really 😄)!

Application Runtime

Now that we’ve got our application, it’s time to figure out how to run it in production. For the purpose of this post, I set up two application entry-points:

  1. Directly, by calling python (using aiohttp.web.run_app), or…
  • Gunicorn was configured to use a single aiohttp.GunicornUVLoopWebWorker

The application itself is built on a Docker image using a multi-stage build with a Python/Alpine-Linux base image to ensure the image is as small as possible.

It should be noted that aiohttp mentions in its documentation that running an aiohttp server behind Gunicorn will result in slower performance.

Application Deployment

Both applications were deployed using ankh behind an Nginx Ingress, with identical Service definitions, and the following resource profiles:

replicas: 10
cpu: 1
memory: 512Mi
cpu: .1
memory: 256Mi

By The Numbers

With my applications deployed and bugs squashed, it’s now time to get a feel for how these two services will run.

Application Performance

All benchmarks below were run using hey, set to 200 concurrent connections hammering our servers for 30s. There was no rate-limiting implemented, as our goal was to determine deployment performance under high-stress and full resource utilization.

We set the following SLAs for our servers:
1. GET: 99.9% under 100 ms
2. POST: 99.9% under 150 ms
3. PUT: 99.9% under 200 ms
Requests Per Second — Head to Head
Response Time Distribution within 99.9% — GET — 1ms Buckets
Response Time Distribution within 99.9% — POST — 1ms Buckets
Response Time Distribution within 99.9% — PUT — 1ms Buckets
Head-to-Head Distribution, All Quantiles. Click through to play around!

Resource Utilization

For the bare aiohttp deployment, the replica set ran at ~1.15Gi Memory and <.01 CPU overall (~115Mi Memory and ~0 CPU per pod). While under load, the CPU limit of 7 was utilized between 90–100% (around 90% for the GET test, 100% for the PUT), but memory usage never grew beyond 1.5Gi, well under our 5Gi limit.

The Gunicorn deployment consistently used about 30% more memory and CPU utilization was marginally higher, about 95%-105%*.

*Kubernetes enforces CPU limits with throttling, not by killing your container, as with memory limits. This means that you may see occasional spikes slightly above your configured limit. I found this article helpful in understanding this mechanism.

Initial Assessment

All-in-all, the performance of the two deployments is nearly identical, and the slight service degradation introduced with Gunicorn isn’t necessarily a deal-breaker, depending upon the SLAs your particular application must meet. However, if Gunicorn is, in fact, hampering the performance and reliability of your application in this deployment architecture, should it be used at all?

Additional Benchmarks

With all this data under my belt, I decided to see if I could test a more “standard” Gunicorn-style deployment in order to take advantage of Gunicorn’s ability to scale vertically, following the age-old rule-of-thumb mentioned in the Gunicorn documentation.

I landed on the following the resource profile for the Gunicorn deployment:

replicas: 2
cpu: 5
memory: 3Gi
cpu: 5
memory: 2Gi

With 11 workers per Pod, giving us a total limit of 10 CPU, 6Gi Memory, and 22 workers for the replica set.

Application Performance

Here are the charts we saw above, with this deployment in the mix…
Response Time Distributions within 99.9% — GET — 1 ms Buckets
Response Time Distributions within 99.9% — POST — 1 ms Buckets
Response Time Distributions within 99.9% — PUT — 1 ms Buckets
Head-to-Head Distribution, All Quantiles. Click through to play around!

With a total of 22 workers over 2 Pods in the Replica Set, this deployment maxed out its 10 CPU limit and consistently ran at ~3.5Gi memory. Thats ~43% more CPU and 2x more memory.

Not only that, this deployment couldn’t even touch the previous two in terms of performance and reliability, and was far outside our SLAs for all operations. One could argue that scaling up on each Pod or scaling out the Replica Set would improve this, and they’d be correct. However, at this point we’re already using a significantly higher number of resources to achieve a sub-par result and scaling up or out to meet the performance of the alternative deployments goes against the core mindset of Kubernetes deployments: small, lightweight containers which can scale out on-demand.

Final Assessment

While no application is the same, I believe that the data above shows the fallacy of assuming a deployment strategy based upon historical solutions. While Gunicorn didn’t necessarily hamper the performance of our application if deployed correctly, its usage came at the cost of:

  1. An additional dependency that changes the run-time of your application in production vs your run-time in development.

My recommendation (if you haven’t guessed it already) is to forego this production dependency altogether. Deploying a web service on Kubernetes behind Gunicorn provides no additional benefit in regards to performance or stability, at the cost of greater resource needs.


Our latest thoughts, challenges, triumphs, try-again’s…

Thanks to Ahmed Abdalla and Shreyas Prasad

Sean Stewart

Written by

New York based Software Engineer and Fiction writer.


Our latest thoughts, challenges, triumphs, try-again’s, most snarky and profound commit messages. Our proudest achievements, deepest darkest technical debt regrets (just kidding, maybe). All the humbling yet informative things you learn when you try to do things with computers.

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade