Django Monolith on Overdrive: How Performance Tuning Slashed First Request Latency by 40x

Wico Chandra
julotech
Published in
6 min readJul 18, 2023
source: unsplash.com

At JULO, we face the challenge of a Monolith application, which comes with the drawback of numerous dependencies that must be loaded in order to execute a basic logic endpoint. The JULO backend service itself is a Monolith system, encompassing over 200 library dependencies and more than 100 Django sub-apps, all within a single Django application. However, the abundance of imports, both libraries and sub-apps, can potentially introduce performance overhead during the initial request. This article delves into JULO’s approach to addressing and overcoming the issue of first request latency.

Why does the first request latency matter?

We define “the first request” as the initial user request received by the application server immediately after it starts or restarts. This is a critical moment, as any delay during this first interaction can significantly impact the user experience of the system’s responsiveness.

In an ideal scenario, the issue of first request latency would be inconsequential if our web server never required restarting. Unfortunately, this is rarely the case in the real world. As a technology company, our primary focus revolves around consistently delivering product enhancements to meet the needs of our users. With a constantly expanding user base, scaling our product by introducing new servers becomes an unavoidable necessity. Consequently, restarting or spawning a new backend (BE) service becomes a routine and integral part of our operations. It is of utmost importance to us that our users do not encounter any performance degradation, particularly during the crucial first request after the service is started.

How did we find this issue?

During our investigation of high latency issues in our Datadog API, we discovered an anomaly related to the Django built-in middleware, CommonMiddleware. This middleware was taking more than 10 seconds to finish. We also observed the same issue when testing our Health Check API, which should ideally respond within less than 50ms.

JULO’s Health Check’s flame graph from Datadog Dashboard

To identify the root cause, we initially attempted to use Django-silk, a profiler tool. However, since Django-silk’s middleware needed to be executed first, it wasn’t suitable for this particular scenario. Instead, we decided to use CProfile and integrated the profiler logic into the CommonMiddleware itself. The modified code can be found below:

# site-packages/django/middleware/common.py
class CommonMiddleware(object):
def process_request(self, request):
import cProfile as profile
profiler = profile.Profile()

# Start profiling the execution.
profiler.enable()

# ....

# Disabling the profiling
profiler.disable()

import pstats
# Print the output order by
with open('./output.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output).sort_stats('cumtime')
ps.print_stats()

with open('./output_callees.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output)
ps.print_callees()

with open('./output_callers.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output)
ps.print_callers()

By analyzing the profiler output, which we got from reading Profiling python with cProfile, we were able to gather valuable insights:

  1. Duration of each function call (percall, inclusive, and exclusive).
  2. Number of times a function was called (ncalls).
  3. Actual time taken by a function (tottime, excluding times of other functions).
  4. Total time taken by a function (cumtime, including times of other functions).
  5. Functions called by a given function (callers).
  6. Functions that call a given function (callees).

From the output.profile file, we identified that our library JuloLog took 16s to finish.

JuloLog was taking a cumulative time of 16 seconds.

To further investigate, we examined the output_callees.profile file and discovered that the usage of Python’s inspect library was causing the slowdown.

The JuloLog library executed ‘inspect.stack’ 32 times

To verify this, we removed the inspect library usage and re-executed the test, which resulted in a significant decrease in execution time from 16 seconds to 0.5 seconds.

The Django API Execution is 500ms

Based on the above result, we determined that the other issue was the import of the juloserver.urls module. To resolve this, we made the adjustment by adding the import statement after the initialization of the app, and executed another testing.

The Django API Execution is 4ms

While the above steps addressed has improved the performance of the first request, we noticed that the API call itself still took 400ms. To further investigate, we profiled Django’s Application handler using the same principle as in the CommonMiddleware.

# /lib/python3.7/site-packages/django/core/handlers/wsgi.py
class WSGIHandler(base.BaseHandler):
initLock = Lock()
request_class = WSGIRequest

def __call__(self, environ, start_response):
import cProfile as profile
profiler = profile.Profile()

# Start profiling the execution.
profiler.enable()

# ... the built-in logic

profile.disable()
# print the result like the previous one.

Our tests revealed that the load_middleware function took 0.353s. We hypothesized that loading all middlewares during the first request was the cause of the delay. The second request demonstrated significantly faster API responses, supporting our assumption.

The first request’s output result
The second request’s output results after load_middleware is executed.

In conclusion, our investigation led us to identify three root causes contributing to the high latency issue:

  1. The usage of Python’s inspect library.
  2. Loading all of our monolith URL routes that’s only happening on the first request.
  3. Loading all registered middlewares in MIDDLEWARE_CLASSES settings also happening on the first request. We have 17 middlewares that need to be loaded.

By addressing these root causes, we were able to improve the overall performance and responsiveness of our API consistently.

The Solution Summary

We implemented two solutions to address the issue. Firstly, we refactored our code to remove the usage of the inspect library. The inspect library was initially introduced to enhance the functionality and ease of use of our functions. Fortunately, this refactoring did not impact any business logic, allowing us to deploy the fix promptly. As a result, the maximum latency decreased from 16 seconds to 2 seconds.

The max latency was reduced to 2s

The second solution involved loading all URL routes during the Gunicorn’s Master Spawn process. Since we had enabled the preload_app option in Gunicorn, the spawned worker processes utilized the loaded modules from the master process. To implement this solution, we made modifications to the wsgi.py file to import the juloserver.urls module and load the middlewares in there:

import juloserver.urls  
application.load_middleware()

The updated content of our final wsgi.py file is as follows:

import os, sys

#....

if settings.DEBUG:
application = StaticFilesHandler(get_wsgi_application())
else:
application = get_wsgi_application()


# Preload the necessary resource to reduce the worker load on the first request.
import juloserver.urls # The Django main URI route definition
application.load_middleware()

After deploying these fixes, the maximum latency was further reduced from 2 seconds to 400 milliseconds. Additionally, we observed a reduction in memory footprint of approximately 500MB due to the preload_app configuration.

The max latency was reduced to 400ms
The memory footprint decreased

These solutions successfully resolved the latency issue and improved the overall performance of our system. However, a max response time of 400ms in a production environment is still considered suboptimal. Therefore, it is necessary to conduct further investigation and optimization efforts to achieve more desirable response times and enhance the overall performance of our system.

Key Takeaways

The key takeaways from this experience are:

  1. The usage of Python’s inspect library was the significant contributor to this latency issue. Ensure to assess its necessity of using it in a production code. Otherwise, avoiding its use is recommended for optimal performance.
  2. CProfile has proven to be an excellent tool for identifying and isolating bottlenecks in the application logic. By incorporating CProfile into your profiling toolkit, you can gain valuable insights and optimize your application’s performance with ease.
  3. Ensuring the necessary imports are loaded in wsgi.py is essential for reducing first request latency.

--

--