Leverage Caching To Cushion Our Database

Published in

julotech

8 min readNov 3, 2022

How proper caching implementation can alleviate the strain on database. Co-authored with Fathur Rohman.

Fetching and serving information around the web demands resources in the form of processing power and processing time. The bigger the data, the more expensive it costs, and the number increases the more users flock to your services. As the services attract more users, we must scale the system all the while maintaining its normal operation. For many startups, the expanding user base often means logical operations and queries that used to look passable will have to undergo changes to be more optimal. But there’s only so much you can do on code before teams find themselves worrying over micro-optimization and start a race of premature optimization, plus it takes a gigantic amount of effort to even evaluate where we should begin optimization out of the many services already released.

Facing similar problems, our engineers at JULO concluded that a caching strategy would allow us to alleviate strain in our database while simultaneously serving data faster. In this post, we detail our journey on initiating data caching to increase our capacity while reducing strain on the database.

What is caching? Assume that there’s a book that you want to read in the library, rather than going to the library every time you want to read it you’d probably borrow the book and keep it on your shelf instead. Now you won’t have to waste transportation fees and commuting time revisiting the library. In this analogy, the library would be the database and the shelf is the cache. Simply put, caching stores frequently requested items closer to people who want them, thus increasing access speed and reducing expended resources.

Motivation

A brief overview of our tech stack to get you on board. JULO’s API layer consists of Python, Django, and Django Rest Framework, along with that and our database layer relies on Django ORM and PostgreSQL. For this initiative, we rely on two monitoring tools: Datadog and Grafana.

In JULO, we develop services that are actively used by millions of users including our internal team such as customer service and collection team. As our application grew, we began experiencing spiking latency and throttled database pools to the point where even our team felt the slowness. Using Datadog as our monitoring tool, we can see roughly over 3 trillion requests are served every month. One of the tables has consistently hit around 266 million fetch requests, accounting for almost 9% of the whole SQL queries in our database layer. Further investigation revealed around 32 million update requests, making the entire table take up 12% of the requests in a given month.

Now, here’s an idea, what if we refactor the entire codebase? We can filter our services and sort them by highest request in a month through Datadog, then we can start analysing which services cost the most, after that, we can begin dissecting the business flow and draw some sort of brain map to see the interconnected functions and processes, finally, we can slowly and methodically optimise every code starting from the top. Sure, we can do that. Or… We can start with something simpler and cost-efficient.

Caching

To solve our issue, we needed a way to increase capacity by processing the data faster. Upgrading the database is a band-aid solution that costs more than it’s worth and it doesn’t address the underlying issue. Going through the whole process of dissecting codes and business flow feels like a theoretically sound idea but it costs too much, takes too much time, and is prone to creating new bugs. We could also incrementally improve existing code, sounds more doable but still takes forever. We were looking for a simpler and cheaper solution and we needed it fast. So we turned to our old friend: caching.

Caching has been part of our ecosystem for so long but was never wholly harnessed so we set our sight on its untapped potential. The idea was to see if we could relieve pressure on our database due to the high number of queries. A lot of the queries were accessing the same data, so surely we could reduce recomputing the same result. We knew we needed an in-memory data store system for this, so we had to choose between Memcached and Redis. While both systems store data in memory for fast retrieval, we favoured Redis because it stores data as specific data types while Memcached stores data as string. With how often data updates in our application, the network overhead reduction was very welcomed.

Next, we had to decide how we would integrate the caching system with our project. Django provides Redis support to its caching system that was made available on version 4.0 but our Django runs on an older version so that was not available to us. Reinventing the wheel didn’t make sense so we looked into open source libraries. One of our engineers initiated the usage of django-cacheops library. The library, also known as Cacheops, fits us like gloves as it was made for Redis and caches Django’s ORM.

Now that all the pieces are in place, we can start working on implementation. Being on the cautious side, we divided the process into two phases where each phase would have to go through a monitoring period in staging and UAT environment:

In the first phase, we chose a table with large queries that was mostly safe from complications (writes or updates). The table was widely used across our services but remained quite isolated from any complications due to its simplicity. The TTL targeted for this table was 2 weeks.
The second phase would then involve our most query intensive table that serves over 12% queries in a month. This table is interconnected with many parts of our applications. It has hundreds of thousands of updates in a day so we settled on 30 seconds TTL.

Each phase went through queries check on datadog and our codebase, making sure there were no stones unturned (or in this case, queries unturned).

Cache Invalidation

Caching is easy but what about cache invalidation? That happens to be one of the hardest problems in computer science. It is important that the data in cache gets invalidated as soon as underlying storage changes. There was concern that if not handled well, this could get us into a situation where the data retrieved from our cache layer were stale, and potentially break our business processes.

When you forget to import the module you’re using in python, at least the compiler will throw an error. In caching, failing to invalidate stale data can accumulate issues silently. It’s difficult to track down because the program still works as intended, only it’s processing the wrong data. When invalidation is done wrong, even the most trivial bug becomes fiendishly difficult to discover.

Django provides two methods to update data:

.save() method is used to write a new record or update an existing record of a model instance
.update() method is used on a queryset object that can update multiple data.

The package works as we wanted when using the .save() method, however, it behaves unexpectedly with the .update() method. Quoting the Django’s documentation itself:

the update() method is converted directly to an SQL statement. It is a bulk operation for direct updates. It doesn’t run any save() methods on your models, or emit the pre_save or post_save signals.

Django has a signal dispatcher that notifies applications when an action occurs elsewhere within the framework. This signal dispatcher allows us to send signals when an event is triggered so we can prehandle and posthandle the event. For example, if you want to create a user profile as soon as the user registers, you can create a posthandle receiver that will execute when the user model is saved.

Cacheops event-based invalidation relies on these signals. Do you see the problem here? There is no way of knowing if a data has been updated when using the .update() method that doesn’t emit any signal. We either have to ban the method entirely from our codebase for every table with caching turned on or invalidate the updated data ourselves. Good news is, the package provides an invalidated_update() method to replace Django’s update() method. Yay! That said, this method comes with one additional query cost that we deem a cheap price to pay (+1 query is nothing if you think about the percentage of queries we’ll shave off).

The Result?

The first phase deployment to production was on 29th September 2022. Using Grafana, you can see the comparison of our database CPU load between one week before caching and the week after. The CPU usage decreases by a whooping 4.7% on average. At the same time Datadog shows reduction from around 100,000 SQL queries per hour to less than 5,000 queries per hour. What a huge improvement, isn’t it?

Database CPU load from 21–09–2022 to 29–09–2022 (before phase 1 deployment)

Database CPU load from 29–09–2022 to 07–10–2022 (after phase 1 deployment)

Query requests huge improvement after 20:00 of 29th September

Finally, we deployed the second phase to production on the evening of 27th October 2022. As you can see, within three days after deployment the caching further reduces the database CPU load to 8.89% on average. We also see a reduction in the frequency of SQL queries by roughly 50%.

Database CPU load from 27–10–2022 to 31–10–2022 (after phase 2 deployment)

Query requests reduction after evening of 27th October 2022

Takeaways

As you can see, caching can be a very simple and effective strategy to scale your database by using fast caching system like Redis or Memcached but the application of caching should be carefully approached:

Choosing what to be cached: If the cache hit ratio is not high enough (fast changing data) then caching may not be effective on such data.
Choosing appropriate TTL: Time to live is to be chosen carefully so as to decrease the chances of applications processing stale data. This can be mitigated if your framework or code ensures invalidation as soon as the update is done in the underlying storage (as pointed above in the post, this is easier said done in some edge case scenarios. Hint: “distributed systems are hard”).
Cache invalidation: If possible, ensure code which invalidates the cache on the data update. Invalidation is simply dropping the corresponding cache key so as the next query for this data hits the DB following a cache-miss and recaching can be done. If not handled, then you will be dependent on cache expiry (TTL) for the invalidation which may lead to nasty issues hard to debug.

Hope you find this useful. Please drop your endeavours, learnings or suggestions on caching.