Improving database performance with Redis

The problem

Scaling relational databases is a big issue. Because of the requirement of joins, it is hard to partition data across nodes and run a cluster of database servers and so, we are in some sense forced to run the relational database server on a single node. Most web applications today are backed by Postgres or MySQL which are themselves excellent databases and have the potential to support a large number of concurrent requests. However, these database have a limit and they cannot be scaled to handle a very large load — like that of Facebook or Google.

There are some optimizations that can be done at the query level in order to improve database performance. For instance, the query can be framed in the best possible way to avoid unnecessary lookups and joins which will tend to load the database. This won’t really help in scaling the application. But, this will certainly improve the performance, which means that the application can support some more users (I’d call that as performance improvement rather than scaling).

There is an important concept which lies at the heart of computer science — caching. Caching has always been helpful in improving performance. Caching happens nearly everywhere

  1. RAM is in some sense a cache for hard disk/SSD
  2. L3 cache is a cache for RAM
  3. Your browser caches the web pages that you visit
  4. DNS servers cache the DNS entries for faster resolution
  5. Web servers cache the web pages that they serve
  6. Dynamic programming is essentially caching

There are a lot more examples where caching has been a great way to improve performance (note that we are talking about improving performance and not scaling). We can do the same for web applications.

From the original docs — Redis is an open source (BSD licensed), in-memory data structure store, used as database, cache and message broker. Here we will talk about redis as a cache.

Previously memcached was fairly common (even Quora uses memcached). But now, a large number of web applications use Redis as a cache and the stackoverflow community recommends using Redis for anything new

How to use it?

Downloading and starting a Redis server is described nicely in the official docs. Redis runs by default on port number 6379.

Web applications have cache configuration settings which can be changed to use Redis as a cache. Here is how you would do that in Django (in the settings.py file)

CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": "redis://127.0.0.1:6379/1",
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
}
}
}

Once the cache has been setup, using Redis to cache data items is fairly simple. Redis is a key value store. To store a value, we associate it with a key and store it in Redis.

For instance, if we have a model named TestModel and we want to cache a particular query, we can do it as follows (of course inside some view):

from django.core.cache import cache
..
if cache.get('all') is not None:
objects = cache.get('all')
else:
objects = TestModel.objects.all()
cache.set('all', objects)

We try to first find the key named `all` in the cache (cache is associated with Redis). If we find it, we return it else we query the database.

This can help in improving the server performance significantly when the cache hit ratio is good.

There is a shortcut for the above case:

objects = cache.get_or_set(‘all’, TestModel.objects.all())

As it suggests, the `get_or_set` method tries to find the key named ‘all’ in the cache and if its not found, it stores the key with the value `TestModel.objects.all()`.

Redis is a lot more powerful and can store lists as well as sets which makes life really easy. A detailed list of data types supported by Redis can be found here.

Problems

Problem happens when writes are issued. For instance, suppose we insert a new entry in our TestModel table. The Postgres database (in hard disk/SSD) will contain 1 extra row (the newly inserted entry), however, Redis cache (in RAM) won’t get updated with this new row and so, we now have a stale cache. Note that our database is in Postgres or MySQL. We are using Redis simply as a cache.

Solution

The solution is fairly simple. Whenever a write is issued, all the caches associated with that particular model (table) are invalidated (marked as stale). The following simple solution works in Django for our case:

@receiver(post_save, sender=TestModel, dispatch_uid="Write issued")
@receiver(post_delete, sender=TestModel, dispatch_uid="Write issued")
def invalidate_cache(sender, instance, **kwargs):
cache.delete('all')

The above code is very Django specific and so it needs some explanation. We are basically using Django signals to achieve our goal — invalidating all caches when a write is issued on a model. So, when `save` is called on a model instance (a row of the table is updated or a new row is added) or when `delete` is called on a model instance (a row of the table is deleted), a signal is issued and the `invalidate_cache` method is called that results in the key `all` getting invalidated. This means that the key ‘all’ is deleted from the cache and so, further lookups will result in a cache miss (effectively resulting in a query being sent to Postgres).

Numbers

I performed some experiments with Redis to get some numbers. A Postgres server was setup aside a Redis server. A table with 100000 rows was created in Postgres. Loadtest was then used to send several queries to the server that effectively resulted in a full table scan. The numbers were impressive and Redis did help in improving performance by a good factor:

Without Redis: 0.98 seconds
With Redis: 0.005 seconds

Conclusion

Besides query optimization, Redis caching can possibly be a good way to reduce database server’s load and improve its performance.