Our Tango with Django!
A short tale on how we scaled our Django backend to serve the next generation of learners
Often known as “the web framework for perfectionists with deadlines”, Django is a high-level web framework which was created for swift web project development. Built over Python, it is not the outcome of an academic research or the brainchild of a developer who believed in a higher level of proficiency. Django was created in a newsroom environment, where today is much more important than clever. It has its rough edges for sure, but its pragmatic approach to getting stuff done is why we decided to go ahead with it, as we started out to build Unacademy. Rapid innovation is one of our core principles.
Primarily hosted on AWS, we deploy a wide range of technologies to enable optimum resource utilisation. At the forefront we have an NGINX server handling the incoming request traffic, serving static content and passing on the dynamic requests to the Django application. However, it can not talk directly to the applications. This is where Gunicorn enters the scenario, as a Python WSGI HTTP server that runs applications workers, feeds them requests and returns the responses. Gunicorn will create a Unix socket, and serve responses to NGINX via the WSGI protocol —bidirectional data flow through the socket.
Our tech stack has pretty much solved most of our problems and we were able to streamline our focus on product development. However, things began to change as we ventured into new categories, marketing campaigns took off and Unacademy became a household name.
With great popularity comes a greater need for infrastructural stability
Our user base nearly doubled within a short span, from 5.1M in Oct’18 to 10M+and growing, as of today. We started recording 1M+ Daily Views on a consistent basis and API traffic rose to around 90K requests/min. This sudden ascent caught us off guard and 5xx status responses seemed to be at an all time high. We were running more EC2 instances than ever and the costs were skyrocketing. Something had to be done, and fast.
We already had a few microservices up and running by now on a different stack with Go at it’s core, bereft of any performance issues. The surmounting question was to invent methods and continue optimizing Django, or to migrate the majority of our Python code from the application into separate independent services, probably in Go. The latter would require significant developer effort since around 70% of our codebase resided in that Django repository, and it meant that we would not be able to innovate on our product front meanwhile.
At some other time and place, we might have gone ahead with it. But in a landscape where either you iterate, or you die; It was a no-brainer
Django docs provided some useful insights on improving performance. We had already employed most of them that proved to be pretty useful, but the rest weren’t enough to move the needle for us. However, our research lead us to some fruitful solutions. I will enlist some that made a difference:
The advantage with content platforms is that effective layers of caching can produce tremendous results, owing to the fact that a good portion of response fields are user-independent. Note that Django, by default, caches querysets and prevents continuous query bombardment to the database. However, it is the subsequent operations on the queryset, specifically serialization, of the objects to compute inter-model dependant fields that takes up a significant amount of time. We used Cacheops to cache the serialized results for functions, with a timeout for eviction.
Inheriting from the python rest_framework's
serializers.ModelSerializer class, we wrote our own set of
CacheBasedSerializers to develop and support enhancements at will.
- Group Serializer methods: To serialize a queryset or list of objects, we can pass pass the
many=Trueflag to the serializer. Default behaviour is to iterate over the list, treat each instance as a single entity and serialize it. Serialization often required making expensive database calls to compute field values and in a majority of cases, these database calls could be grouped together if the serializer had knowledge of other objects in the list. And that’s exactly what we did. Our custom serializers made a single database query for all the object_ids in the list, fetched the resulting values in a map with object_id as the key
- Cache, No-Cache Fields: With the exception of personalised content, most of the serialized data can be cached across users for a decent period of time. Caching the complete data often doesn’t work well, and not caching leaves us with redundant calculations. As a Hybrid approach, we introduced the concept of no_cache_fields in our custom serializers to enlist the fields that are meant to be computed fresh.
Profiling our utilities with the cProfile module helped us diagnose bottlenecks in our code and some poorly written queries (n+1 query problem). This opened a decent room for improvement and we made some good use of Django’s
After a few iterations, we could see the results. Our P90 response time on average across major requests was down by 47% and all of our API calls were now under 700ms
Next job was to increase the utilisation of our EC2 instances, which then were running on an average CPU usage of 30–40%
Performant Gunicorn Config for Growth
- Python, at its core is a single-threaded language because of a concept called Global Interpreter Lock, which prevent two threads from executing simultaneously in the same program. (Wait, what? I thought there is a threads module in Python for multithreading?!) Yes, there is. And if it seems confusing at first, you might want to check out this article.
- However, concurrency can still be achieved by using “pseudo-threads” implemented with coroutines. We use the Gevent library with Gunicorn to specify the number of connections(or pseudo-threads) per worker process, that shall execute concurrently.
- Our application is dominantly I/O bounded, hence using gevent threads had a major upside on performance. While a connection was waiting on an I/O, it would pass the control onto the next.
- The two vital numbers driving performance here are:
— Number of Gunicorn workers per server (say wg), and
— Number of active connections per worker (say active_con)
According to popular notion, wg should be set to
(2 * $num_of_cores)+1However, this can vary. Its more about trial and error along with benchmarking and fine-tuning these numbers on what works best for your scenario.
- Our CPU usage stats on an EC2 c5.4xlarge instance, with config:
wg = 2 * (16) + 1 = 33
active_con = 10
As you can observe, the cores were not being used properly. As a result, CPU utilization levels remained low. We continued our work on fine-tuning these values, replaying live traffic from server instances to benchmark our testing server. And this is how, our CPU stats look now:
AWS EC2 service contributed to our 35% of our billing costs. And the aforementioned server and code level optimizations enabled us to reduce our costs significantly. We employed various other techniques for cost reduction across other services as well, but that’s a tale for another time!
Engineering at Unacademy involves more than just technical challenges, with vaguely defined problems where you have to step in the shoes of the end user to figure out apt solutions. Unacademy is transforming the way people learn and perceive education. And this is just the beginning. We are merely clapping our wings, preparing to take flight. Want to join? Head over to our careers page for further details.