Identifying CPU-Bound Code in Asynchronous Python Web Applications

6 min readJul 19, 2023

In today’s data-driven world, feature stores have become a vital component of machine learning pipelines and models in production, enabling efficient storage, real-time computing, and retrieval of features for model training and inference.

However, as our feature store grew in size and complexity, Lemonade encountered a significant performance bottleneck that impacted our overall system efficiency (In the previous post, I shared how we migrated it to async architecture). After a thorough investigation, we discovered the root of the issue: CPU-bound code. In this article, we’ll delve into how we identified and addressed CPU-bound code in our feature store, enabling us to improve the performance of many core models in production.

But before we dive into the problem, let’s take a moment to understand one of the key metrics in any async application — the event loop lag.

If you’re not familiar with Python’s event loop, this is an excellent resource to help you get started. 🙏

The event loop lag

The event loop lag is the time between the scheduling of a task and its execution

What exactly is the event loop lag? It refers to the time between scheduling and execution that measures how much time our asynchronous code consumes.

For example, we have a task that involves fetching data from another web service and saving it to the database, the duration between these executions is referred to as the ‘lag’.

To illustrate, imagine that one of your app’s requests holds the event loop for a long period of time, preventing it from handling any other requests during that time. Why does this happen? Because the entire asynchronous architecture relies on the assumption that our app primarily performs IO-bound operations, allowing the event loop to handle requests in the intervals between tasks.

Monitoring the lag

Unfortunately, there isn’t a built-in API or statistic available for monitoring the exact event loop lag. However, you can calculate it manually. Let’s implement it from scratch, keeping in mind that we need to be cautious about what we include in this coroutine, as it will run for the entire application lifetime.

In this example, we’ll make use of a sleep function that pauses for a specified duration (1 second). After the event loop returns, we’ll calculate the difference between the actual sleep time and the desired sleep time (1 second).

async def monitor_event_loop_lag(loop: AbstractEventLoop):
    start = loop.time()
    sleep_interval = 1

    while loop.is_running():
        await asyncio.sleep(sleep_interval)
        diff = loop.time() - start
        lag = diff - sleep_interval
        # send lag as a statsd metric
        if lag > 1:
            tasks = asyncio.all_tasks(loop)
            for task in tasks:
              if task._coro.cr_code.co_name != "monitor_event_loop_lag":
                  log.warn(f"event loop lag:{lag}, task: {task}")
        start = loop.time()

Essentially, we’re trying to calculate how long the event loop was “stuck” in executing a particular task. If this duration exceeds 1 second, we will print all the tasks involved in order to identify the potentially critical CPU-intensive code.

To that end, we’ll need to add the task to the event loop of the FastAPI app. This enables us to track the execution time and detect any tasks that exceed the desired threshold.

app = create_app()

@app.on_event("startup")
async def startup_event():
    loop = asyncio.get_running_loop()
    loop.create_task(monitor_event_loop_lag(loop))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

Identifying the Performance Bottleneck

It all started with the degradation of latency in one of our core models in production. The p90 and p95 metrics nearly tripled, and we were unaware of the underlying cause. Upon analyzing the model traces, we initially suspected that the bottleneck was in a different downstream service (the green area in the image below) since it appeared that the majority of the request time was not spent in the feature store (the blue area in the image).

However, upon revisiting the event loop warnings, we were startled to discover a correlation between the event loop lag warnings and the requests that experienced prolonged processing times. As you can see below, the lag reached 30 seconds, which means that the event loop was unable to handle requests for nearly that long 😢

Here we can see the requests to the model:

The root cause

Upon investigating the model’s features, we discovered that some features rely on downloading large files. However, as we previously discussed, this shouldn’t significantly impact the event loop performance since downloading files is generally an IO-bound operation.

As part of our migration process, we developed a unified HTTP client to handle both asynchronous and synchronous operations, aiming to simplify the workflow for our data scientists. However, we encountered an issue with the client.

Using the logs we added, we found that the unified HTTP client always loaded the JSON from the HTTP response and returned a unified response to the client.

data = await resp.json()

In theory, the json method is an asynchronous operation. However, let's delve into the code of aiohttp library and examine how this method can potentially impact the performance of the event loop.

async def json(
    self,
    *,
    encoding: Optional[str] = None,
    loads: JSONDecoder = DEFAULT_JSON_DECODER,
    content_type: Optional[str] = "application/json",
) -> Any:
    """Read and decodes JSON response."""
    if self._body is None:
        await self.read()

    if content_type:
        ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower()
        if not _is_expected_content_type(ctype, content_type):
            raise ContentTypeError(
                self.request_info,
                self.history,
                message=(
                    "Attempt to decode JSON with " "unexpected mimetype: %s" % ctype
                ),
                headers=self.headers,
            )

    stripped = self._body.strip()  # type: ignore
    if not stripped:
        return None

    if encoding is None:
        encoding = self.get_encoding()

    return loads(stripped.decode(encoding))

The issue lies in the last line. In the case of a very large file, decoding and loading the JSON can be time-consuming as both of these operations are CPU-bound. This is why it is important to invoke this method only when necessary, specifically when the feature store requires the actual JSON format, and not for every response the feature store is getting.

Results

By separating the client into synchronous and asynchronous operations and minimizing the times the feature store is calling the jsonmethod, we were able to achieve a significant improvement in the model’s latency.

The image below provides a visual representation of the effect this change has on both the p90 and p95 latency metrics for one of our core models.

Summary

In conclusion, the post highlighted the significance of monitoring event loop lag and demonstrated how it can be implemented in async Python web apps.

By following the step-by-step implementation guide provided in this post, you can proactively detect and address performance bottlenecks in your async Python web apps. This proactive approach allows you to focus on optimizing those areas, improving the overall performance of your app.