Concurrency in Scaling Real Python Web Apps

Published in

unified-engineering

6 min readMar 17, 2020

In this post I’ll describe how the team solved a real world scaling problem and discuss how it relates to some fundamental principles of memory management and parallel computation both in Python and at a systems level.

The Engine’s Broken, Help!

One of the core features of the Unified platform is detailed reporting on the state of a customer’s marketing campaigns. As Unified acquired more customers and added new features to the platform, the volume and complexity of these reports started to grow as well. Unfortunately, as a result of this growth our reporting infrastructure started to suffer from performance degradation and failures. The utilization of our reporting features are spiky in the sense that during certain periods of time there are more likely to be a large number of reports requested all at once e.g. end of the month or end of the quarter. During these periods our reporting system wasn’t handling the load effectively and would fail. Users would request a report through the platform but the report would never get emailed to them. Unfortunately there was also a second problem, in addition to users not receiving the reports they requested, under high reporting loads other seemingly unrelated parts of the application started to slow down as well.

Under the Hood

At the time, the team was handling all reporting via a Python webserver using uWSGI and would have one or two EC2 instances each running 16 web processes. When the team started to investigate the cause of the reporting failures they found that the web server processes were running out of memory. When a report request came in, the process handling that request would spin off a new thread to execute the reporting logic concurrently. This worked for a time but began to degrade as the number and size of reporting requests increased. Furthermore, not only was this Python application handling all reporting requests but it was also the API server for several other applications. Therefore when high volumes of report requests came in the web processes would get tied up handling the reporting thus impacting the performance of other applications relying on the same server.

Inside the Engine

Before diving into our solution I think it’s worth a brief discussion of two Python concepts that sit at the core of these issues. The first relevant concept is concurrency. In order to understand concurrency in Python we must briefly mention the infamous Global Interpreter Lock, or GIL for short. Python’s interpreter CPython, the C program that takes Python expressions, converts them to byte code, and then evaluates them, contains logic that only allows for the evaluation of Python code using a single thread at a time. There is a single, global, mutex which locks any other threads from evaluating Python code in parallel. What are the implications of this? It means that only one Python thread can be running and all other threads must be either sleeping or awaiting I/O operations.

How does this relate to our problem? Take the scenario when one client requests a report from the webserver and then another client requests some data from a different API endpoint on the same webserver. Even though the first request is running in a different thread, the web process’s interpreter still has to switch between evaluating the python code to handle the report request and the API request. At this point you might be able to guess what happens next. As the volume of requests goes up, the number of interpreters stays the same and as a result the execution time of handling all request goes up.

An astute reader might see this problem and suggest that if the interpreter is a bottleneck one might increase the number of web processes from 16 to some greater number. While this approach does introduce a higher level of parallelism by circumnavigating the GIL it also has its own drawbacks. One of those drawbacks is the impact on memory. At this point we turn to a discussion of our second Python concept, memory management. One of the big differences between threads and processes is the way that memory is managed. Whereas separate processes are allocated their own memory, threads exist within a process and thus share the same memory. Python’s multiprocessing library has two ways to create new processes. In the first, a new process is created by copying or forking an existing one. The second way spawns a completes new process. Regardless of how new processes are created however, they are still less memory efficient than threads and somewhat slower to create. Why is this relevant to our reporting issue? Most of the reports being generated by our web server load multiple large datasets into memory for processing, usually in the form of a pandas data frame but, as we observed previously, the main source of reporting failure was out of memory errors.

In the Shop

Given that simply increasing the number of webserver processes running on our EC2 instance wouldn’t solve the memory and performance issues, we had to come up with a more scalable solution. So what did the team implement? A queue! This is a very common programming pattern and despite being simple is nonetheless an extremely effective approach to solve problems introduced by parallel computation.

The resulting architecture decoupled the processes responsible for handling API requests with those responsible for processing reporting requests. Instead, when a report request came in it was routed to a Kafka queue. A group of EC2 instances running a consumer process could then subscribe to the queue and generate reports independently of one another. Most importantly, each consumer process was running on its own EC2 instance to ensure it had enough available memory and all the instances themselves were housed in an autoscaling group. Now, when a large number of report requests come in, the autoscaling group can spin up new instances to meet demand. Reports no longer fail due to memory constraints and have no performance impact on the processes handling critical API requests.

Brand New Ride

I thought this was an interesting design problem to discuss because it touches on so many core programming principles. Despite being relatively simple I found it to be illustrative of several important concepts usually discussed in more theoretical or abstract terms. What the team implemented at a systems level is analogous to something you might see on a much smaller scale in code. I think it’s cool to see how the same solution can be applied to solve the same problem on both the micro and macro levels. Here’s an example of a simple Python script using a queue and applying the same concurrency principles to solve the “reporting problem”.

I hope you all enjoyed this exploration of concurrency problems viewed through the lens of a real reporting performance issue experienced by the engineering team here at Unified. If you’re a developer looking to work on scaling problems at the intersection of marketing and technology be sure to checkout Unified at https://unified.com/about/careers-and-culture!