Django Monolith on Overdrive: How Performance Tuning Slashed First Request Latency by 40x
At JULO, we face the challenge of a Monolith application, which comes with the drawback of numerous dependencies that must be loaded in order to execute a basic logic endpoint. The JULO backend service itself is a Monolith system, encompassing over 200 library dependencies and more than 100 Django sub-apps, all within a single Django application. However, the abundance of imports, both libraries and sub-apps, can potentially introduce performance overhead during the initial request. This article delves into JULO’s approach to addressing and overcoming the issue of first request latency.
Why does the first request latency matter?
We define “the first request” as the initial user request received by the application server immediately after it starts or restarts. This is a critical moment, as any delay during this first interaction can significantly impact the user experience of the system’s responsiveness.
In an ideal scenario, the issue of first request latency would be inconsequential if our web server never required restarting. Unfortunately, this is rarely the case in the real world. As a technology company, our primary focus revolves around consistently delivering product enhancements to meet the needs of our users. With a constantly expanding user base, scaling our product by introducing new servers becomes an unavoidable necessity. Consequently, restarting or spawning a new backend (BE) service becomes a routine and integral part of our operations. It is of utmost importance to us that our users do not encounter any performance degradation, particularly during the crucial first request after the service is started.
How did we find this issue?
During our investigation of high latency issues in our Datadog API, we discovered an anomaly related to the Django built-in middleware, CommonMiddleware
. This middleware was taking more than 10 seconds to finish. We also observed the same issue when testing our Health Check API, which should ideally respond within less than 50ms.
To identify the root cause, we initially attempted to use Django-silk, a profiler tool. However, since Django-silk’s middleware needed to be executed first, it wasn’t suitable for this particular scenario. Instead, we decided to use CProfile and integrated the profiler logic into the CommonMiddleware
itself. The modified code can be found below:
# site-packages/django/middleware/common.py
class CommonMiddleware(object):
def process_request(self, request):
import cProfile as profile
profiler = profile.Profile()
# Start profiling the execution.
profiler.enable()
# ....
# Disabling the profiling
profiler.disable()
import pstats
# Print the output order by
with open('./output.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output).sort_stats('cumtime')
ps.print_stats()
with open('./output_callees.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output)
ps.print_callees()
with open('./output_callers.profile', 'w') as output:
output.seek(0)
output.truncate()
ps = pstats.Stats(profiler, stream=output)
ps.print_callers()
By analyzing the profiler output, which we got from reading Profiling python with cProfile, we were able to gather valuable insights:
- Duration of each function call (
percall
, inclusive, and exclusive). - Number of times a function was called (
ncalls
). - Actual time taken by a function (
tottime
, excluding times of other functions). - Total time taken by a function (
cumtime
, including times of other functions). - Functions called by a given function (
callers
). - Functions that call a given function (
callees
).
From the output.profile
file, we identified that our library JuloLog
took 16s to finish.
To further investigate, we examined the output_callees.profile
file and discovered that the usage of Python’s inspect
library was causing the slowdown.
To verify this, we removed the inspect
library usage and re-executed the test, which resulted in a significant decrease in execution time from 16 seconds to 0.5 seconds.
Based on the above result, we determined that the other issue was the import of the juloserver.urls
module. To resolve this, we made the adjustment by adding the import statement after the initialization of the app, and executed another testing.
While the above steps addressed has improved the performance of the first request, we noticed that the API call itself still took 400ms. To further investigate, we profiled Django’s Application handler using the same principle as in the CommonMiddleware
.
# /lib/python3.7/site-packages/django/core/handlers/wsgi.py
class WSGIHandler(base.BaseHandler):
initLock = Lock()
request_class = WSGIRequest
def __call__(self, environ, start_response):
import cProfile as profile
profiler = profile.Profile()
# Start profiling the execution.
profiler.enable()
# ... the built-in logic
profile.disable()
# print the result like the previous one.
Our tests revealed that the load_middleware
function took 0.353s. We hypothesized that loading all middlewares during the first request was the cause of the delay. The second request demonstrated significantly faster API responses, supporting our assumption.
In conclusion, our investigation led us to identify three root causes contributing to the high latency issue:
- The usage of Python’s
inspect
library. - Loading all of our monolith URL routes that’s only happening on the first request.
- Loading all registered middlewares in
MIDDLEWARE_CLASSES
settings also happening on the first request. We have 17 middlewares that need to be loaded.
By addressing these root causes, we were able to improve the overall performance and responsiveness of our API consistently.
The Solution Summary
We implemented two solutions to address the issue. Firstly, we refactored our code to remove the usage of the inspect library. The inspect library was initially introduced to enhance the functionality and ease of use of our functions. Fortunately, this refactoring did not impact any business logic, allowing us to deploy the fix promptly. As a result, the maximum latency decreased from 16 seconds to 2 seconds.
The second solution involved loading all URL routes during the Gunicorn’s Master Spawn process. Since we had enabled the preload_app
option in Gunicorn, the spawned worker processes utilized the loaded modules from the master process. To implement this solution, we made modifications to the wsgi.py
file to import the juloserver.urls
module and load the middlewares in there:
import juloserver.urls
application.load_middleware()
The updated content of our final wsgi.py
file is as follows:
import os, sys
#....
if settings.DEBUG:
application = StaticFilesHandler(get_wsgi_application())
else:
application = get_wsgi_application()
# Preload the necessary resource to reduce the worker load on the first request.
import juloserver.urls # The Django main URI route definition
application.load_middleware()
After deploying these fixes, the maximum latency was further reduced from 2 seconds to 400 milliseconds. Additionally, we observed a reduction in memory footprint of approximately 500MB due to the preload_app
configuration.
These solutions successfully resolved the latency issue and improved the overall performance of our system. However, a max response time of 400ms in a production environment is still considered suboptimal. Therefore, it is necessary to conduct further investigation and optimization efforts to achieve more desirable response times and enhance the overall performance of our system.
Key Takeaways
The key takeaways from this experience are:
- The usage of Python’s
inspect
library was the significant contributor to this latency issue. Ensure to assess its necessity of using it in a production code. Otherwise, avoiding its use is recommended for optimal performance. - CProfile has proven to be an excellent tool for identifying and isolating bottlenecks in the application logic. By incorporating CProfile into your profiling toolkit, you can gain valuable insights and optimize your application’s performance with ease.
- Ensuring the necessary imports are loaded in
wsgi.py
is essential for reducing first request latency.