Beyond “Hello World”: Modern, Asynchronous Python in Kubernetes
Deploying Scalable, Production-Ready Web-Services in Python 3 on Kubernetes
--
Python has undergone something of an evolution in the past few years. From Python 3.4 to 3.7 we have seen the introduction of asyncio
, the introduction and formalization of async/await
keywords, and re-investment in asyncio
performance. Writing asynchronous code in Python has never been easier, more performant, or more efficient.
In addition to the improvements to the stdlib, Python’s Open-Source community has entered something of a Renaissance as well. The Open-Source community has embraced the potential of async/await and the flexibility of the asyncio
library has proven a huge boon. asyncio
’s extensible API readily encourages alternative Event Loop implementations and we now have libraries like uvloop
, which is an asyncio
compatible implementation of the Event Loop using libuv
under the hood. Additionally, when it comes to web frameworks, there couldn’t be more options to choose from, and there are decisive benchmarks out there pitching all of them head-to-head.
However, when it comes to building a new application, there is decidedly little chatter about what these benchmarks mean for you in the context of how your application will be deployed. Will it be deployed via a cloud-based Virtual Machine? Directly to a server? What about Kubernetes or Docker Swarm?
Assumptions
This post assumes:
- You’ve already made the (wise) decision to use an async framework for your web service
- You are looking at Kubernetes for deploying your service.
For the purposes of this post, I chose the aiohttp framework for its maturity and stability, but the general rules provided here should be applicable to any open-source framework on the market today.
It’s all about Scaling
When we talk about scaling, we generally refer to one of two major approaches:
- Horizontal Scaling — scaling across machines and/or environments
- Vertical Scaling — scaling up on the resources of a given machine.
What We’re Testing
More traditional deployments require a mix of vertical and horizontal scaling, with an emphasis on vertical — by way of maximizing the use of available CPU cores on your machine. For Python web-services, that usually means running your application behind Gunicorn or another similar solution in production. I agree that for these environments, this is definitely the appropriate strategy.
When your application is deployed on Kubernetes, it runs in the foreground on small Docker containers scheduled in Pods with a fraction of a CPU and minimal memory. Kubernetes takes advantage of horizontal scaling. Rather than ramp up the number of threads or worker processes on a single Pod, you scale the number of Pods to meet demand.
If done properly, developing and deploying with Docker can provide us with a very powerful guarantee:
- The application run-time you develop and debug with is the application run-time you deploy.
With this in mind, I set out to determine the following question:
- In the context of Kubernetes, is the addition of a run-time dependency for deployment worth the additional overhead and/or risk?
Application Implementation & Design
I implemented a simple, RESTful API supporting GET/PUT/POST/DELETE using the following libraries:
- Server: aiohttp
- Database: PostgreSQL
- DB Client: asyncpg
Additionally, I installed the following libraries to improve overall performance:
aiodns
& cchardet
are used automatically by aiohttp
if they're available, so are a no-op. uvloop
can be invoked by running uvloop.install()
at the top of your app.py
(or under your ifmain
if you prefer). Be sure you're not creating a global for your loop before doing this (or ever, really 😄)!
Application Runtime
Now that we’ve got our application, it’s time to figure out how to run it in production. For the purpose of this post, I set up two application entry-points:
- Directly, by calling
python app.py
(usingaiohttp.web.run_app
), or… - via Gunicorn, by calling
gunicorn --config=guniconfig app_wsgi:app
- Gunicorn was configured to use a single
aiohttp.GunicornUVLoopWebWorker
- Gunicorn was also configured with a max worker lifetime of 1000 requests, to combat the well-documented memory leak issues that can occur with long-lived workers.
The application itself is built on a Docker image using a multi-stage build with a Python/Alpine-Linux base image to ensure the image is as small as possible.
It should be noted that aiohttp mentions in its documentation that running an aiohttp server behind Gunicorn will result in slower performance.
Application Deployment
Both applications were deployed using ankh behind an Nginx Ingress, with identical Service definitions, and the following resource profiles:
replicas: 10
limits:
cpu: 1
memory: 512Mi
requests:
cpu: .1
memory: 256Mi
By The Numbers
With my applications deployed and bugs squashed, it’s now time to get a feel for how these two services will run.
Application Performance
All benchmarks below were run using hey, set to 200 concurrent connections hammering our servers for 30s. There was no rate-limiting implemented, as our goal was to determine deployment performance under high-stress and full resource utilization.
We set the following SLAs for our servers:
1. GET: 99.9% under 100 ms
2. POST: 99.9% under 150 ms
3. PUT: 99.9% under 200 ms
Resource Utilization
For the bare aiohttp
deployment, the replica set ran at ~1.15Gi Memory and <.01 CPU overall (~115Mi Memory and ~0 CPU per pod). While under load, the CPU limit of 7 was utilized between 90–100% (around 90% for the GET test, 100% for the PUT), but memory usage never grew beyond 1.5Gi, well under our 5Gi limit.
The Gunicorn deployment consistently used about 30% more memory and CPU utilization was marginally higher, about 95%-105%*.
*Kubernetes enforces CPU limits with throttling, not by killing your container, as with memory limits. This means that you may see occasional spikes slightly above your configured limit. I found this article helpful in understanding this mechanism.
Initial Assessment
All-in-all, the performance of the two deployments is nearly identical, and the slight service degradation introduced with Gunicorn isn’t necessarily a deal-breaker, depending upon the SLAs your particular application must meet. However, if Gunicorn is, in fact, hampering the performance and reliability of your application in this deployment architecture, should it be used at all?
Additional Benchmarks
With all this data under my belt, I decided to see if I could test a more “standard” Gunicorn-style deployment in order to take advantage of Gunicorn’s ability to scale vertically, following the age-old rule-of-thumb mentioned in the Gunicorn documentation.
I landed on the following the resource profile for the Gunicorn deployment:
replicas: 2
limits:
cpu: 5
memory: 3Gi
requests:
cpu: 5
memory: 2Gi
With 11 workers per Pod, giving us a total limit of 10 CPU, 6Gi Memory, and 22 workers for the replica set.
Application Performance
Here are the charts we saw above, with this deployment in the mix…
With a total of 22 workers over 2 Pods in the Replica Set, this deployment maxed out its 10 CPU limit and consistently ran at ~3.5Gi memory. Thats ~43% more CPU and 2⅓x more memory.
Not only that, this deployment couldn’t even touch the previous two in terms of performance and reliability, and was far outside our SLAs for all operations. One could argue that scaling up on each Pod or scaling out the Replica Set would improve this, and they’d be correct. However, at this point we’re already using a significantly higher number of resources to achieve a sub-par result and scaling up or out to meet the performance of the alternative deployments goes against the core mindset of Kubernetes deployments: small, lightweight containers which can scale out on-demand.
Final Assessment
While no application is the same, I believe that the data above shows the fallacy of assuming a deployment strategy based upon historical solutions. While Gunicorn didn’t necessarily hamper the performance of our application if deployed correctly, its usage came at the cost of:
- An additional dependency that changes the run-time of your application in production vs your run-time in development.
- Yet another layer to learn and debug — and to ensure your co-workers are familiar with as well.
- At least ~43% more CPU and 2⅓x more Memory if not configured properly, and about ~20% more Memory if done correctly.
My recommendation (if you haven’t guessed it already) is to forego this production dependency altogether. Deploying a web service on Kubernetes behind Gunicorn provides no additional benefit in regards to performance or stability, at the cost of greater resource needs.