Beyond “Hello World”: Modern, Asynchronous Python in Kubernetes

Deploying Scalable, Production-Ready Web-Services in Python 3 on Kubernetes

Sean Stewart
Geek Culture
Published in
7 min readJul 22, 2019


Python has undergone something of an evolution in the past few years. From Python 3.4 to 3.7 we have seen the introduction of asyncio, the introduction and formalization of async/await keywords, and re-investment in asyncio performance. Writing asynchronous code in Python has never been easier, more performant, or more efficient.

In addition to the improvements to the stdlib, Python’s Open-Source community has entered something of a Renaissance as well. The Open-Source community has embraced the potential of async/await and the flexibility of the asyncio library has proven a huge boon. asyncio’s extensible API readily encourages alternative Event Loop implementations and we now have libraries like uvloop, which is an asyncio compatible implementation of the Event Loop using libuv under the hood. Additionally, when it comes to web frameworks, there couldn’t be more options to choose from, and there are decisive benchmarks out there pitching all of them head-to-head.

However, when it comes to building a new application, there is decidedly little chatter about what these benchmarks mean for you in the context of how your application will be deployed. Will it be deployed via a cloud-based Virtual Machine? Directly to a server? What about Kubernetes or Docker Swarm?


This post assumes:

  1. You’ve already made the (wise) decision to use an async framework for your web service
  2. You are looking at Kubernetes for deploying your service.

For the purposes of this post, I chose the aiohttp framework for its maturity and stability, but the general rules provided here should be applicable to any open-source framework on the market today.

It’s all about Scaling

When we talk about scaling, we generally refer to one of two major approaches:

  1. Horizontal Scaling — scaling across machines and/or environments
  2. Vertical Scaling — scaling up on the resources of a given machine.

What We’re Testing

More traditional deployments require a mix of vertical and horizontal scaling, with an emphasis on vertical — by way of maximizing the use of available CPU cores on your machine. For Python web-services, that usually means running your application behind Gunicorn or another similar solution in production. I agree that for these environments, this is definitely the appropriate strategy.

When your application is deployed on Kubernetes, it runs in the foreground on small Docker containers scheduled in Pods with a fraction of a CPU and minimal memory. Kubernetes takes advantage of horizontal scaling. Rather than ramp up the number of threads or worker processes on a single Pod, you scale the number of Pods to meet demand.

If done properly, developing and deploying with Docker can provide us with a very powerful guarantee:

  • The application run-time you develop and debug with is the application run-time you deploy.

With this in mind, I set out to determine the following question:

  • In the context of Kubernetes, is the addition of a run-time dependency for deployment worth the additional overhead and/or risk?

Application Implementation & Design

I implemented a simple, RESTful API supporting GET/PUT/POST/DELETE using the following libraries:

  1. Server: aiohttp
  2. Database: PostgreSQL
  3. DB Client: asyncpg

Additionally, I installed the following libraries to improve overall performance:

  1. aiodns (via aiohttp[fast])
  2. cchardet (via aiohttp[fast])
  3. uvloop

aiodns & cchardet are used automatically by aiohttp if they're available, so are a no-op. uvloop can be invoked by running uvloop.install() at the top of your (or under your ifmain if you prefer). Be sure you're not creating a global for your loop before doing this (or ever, really 😄)!

Application Runtime

Now that we’ve got our application, it’s time to figure out how to run it in production. For the purpose of this post, I set up two application entry-points:

  1. Directly, by calling python (using aiohttp.web.run_app), or…
  2. via Gunicorn, by calling gunicorn --config=guniconfig app_wsgi:app
  • Gunicorn was configured to use a single aiohttp.GunicornUVLoopWebWorker
  • Gunicorn was also configured with a max worker lifetime of 1000 requests, to combat the well-documented memory leak issues that can occur with long-lived workers.

The application itself is built on a Docker image using a multi-stage build with a Python/Alpine-Linux base image to ensure the image is as small as possible.

It should be noted that aiohttp mentions in its documentation that running an aiohttp server behind Gunicorn will result in slower performance.

Application Deployment

Both applications were deployed using ankh behind an Nginx Ingress, with identical Service definitions, and the following resource profiles:

replicas: 10
cpu: 1
memory: 512Mi
cpu: .1
memory: 256Mi

By The Numbers

With my applications deployed and bugs squashed, it’s now time to get a feel for how these two services will run.

Application Performance

All benchmarks below were run using hey, set to 200 concurrent connections hammering our servers for 30s. There was no rate-limiting implemented, as our goal was to determine deployment performance under high-stress and full resource utilization.

We set the following SLAs for our servers:
1. GET: 99.9% under 100 ms
2. POST: 99.9% under 150 ms
3. PUT: 99.9% under 200 ms
Requests Per Second — Head to Head
Response Time Distribution within 99.9% — GET — 1ms Buckets
Response Time Distribution within 99.9% — POST — 1ms Buckets
Response Time Distribution within 99.9% — PUT — 1ms Buckets
Head-to-Head Distribution, All Quantiles. Click through to play around!

Resource Utilization

For the bare aiohttp deployment, the replica set ran at ~1.15Gi Memory and <.01 CPU overall (~115Mi Memory and ~0 CPU per pod). While under load, the CPU limit of 7 was utilized between 90–100% (around 90% for the GET test, 100% for the PUT), but memory usage never grew beyond 1.5Gi, well under our 5Gi limit.

The Gunicorn deployment consistently used about 30% more memory and CPU utilization was marginally higher, about 95%-105%*.

*Kubernetes enforces CPU limits with throttling, not by killing your container, as with memory limits. This means that you may see occasional spikes slightly above your configured limit. I found this article helpful in understanding this mechanism.

Initial Assessment

All-in-all, the performance of the two deployments is nearly identical, and the slight service degradation introduced with Gunicorn isn’t necessarily a deal-breaker, depending upon the SLAs your particular application must meet. However, if Gunicorn is, in fact, hampering the performance and reliability of your application in this deployment architecture, should it be used at all?

Additional Benchmarks

With all this data under my belt, I decided to see if I could test a more “standard” Gunicorn-style deployment in order to take advantage of Gunicorn’s ability to scale vertically, following the age-old rule-of-thumb mentioned in the Gunicorn documentation.

I landed on the following the resource profile for the Gunicorn deployment:

replicas: 2
cpu: 5
memory: 3Gi
cpu: 5
memory: 2Gi

With 11 workers per Pod, giving us a total limit of 10 CPU, 6Gi Memory, and 22 workers for the replica set.

Application Performance

Here are the charts we saw above, with this deployment in the mix…
Response Time Distributions within 99.9% — GET — 1 ms Buckets
Response Time Distributions within 99.9% — POST — 1 ms Buckets
Response Time Distributions within 99.9% — PUT — 1 ms Buckets
Head-to-Head Distribution, All Quantiles. Click through to play around!

With a total of 22 workers over 2 Pods in the Replica Set, this deployment maxed out its 10 CPU limit and consistently ran at ~3.5Gi memory. Thats ~43% more CPU and 2x more memory.

Not only that, this deployment couldn’t even touch the previous two in terms of performance and reliability, and was far outside our SLAs for all operations. One could argue that scaling up on each Pod or scaling out the Replica Set would improve this, and they’d be correct. However, at this point we’re already using a significantly higher number of resources to achieve a sub-par result and scaling up or out to meet the performance of the alternative deployments goes against the core mindset of Kubernetes deployments: small, lightweight containers which can scale out on-demand.

Final Assessment

While no application is the same, I believe that the data above shows the fallacy of assuming a deployment strategy based upon historical solutions. While Gunicorn didn’t necessarily hamper the performance of our application if deployed correctly, its usage came at the cost of:

  1. An additional dependency that changes the run-time of your application in production vs your run-time in development.
  2. Yet another layer to learn and debug — and to ensure your co-workers are familiar with as well.
  3. At least ~43% more CPU and 2⅓x more Memory if not configured properly, and about ~20% more Memory if done correctly.

My recommendation (if you haven’t guessed it already) is to forego this production dependency altogether. Deploying a web service on Kubernetes behind Gunicorn provides no additional benefit in regards to performance or stability, at the cost of greater resource needs.