Using Azure Cosmos DB as your persistent, geo-replicated, distributed cache for ASP.NET Core

A typical pattern in a highly scalable distributed API, is having your cache servers as close as possible to your API boxes, in order to minimise the read latency. If you are on Azure and you are using Azure Redis Cache, I’m sure you’ve adopted a design similar to the following one, in which every API consumes a local Redis instance, which is kept up to date by a Cache Pre-Loader.

A distributed scenario with individual read caches and their pre-loaders

However, keeping all those cache servers in sync is all but trivial: pre-loaders might fail, the database could not be always available, and what you would end up with is individual caches holding different versions of the data. Which might result in inconsistencies for your users.

The simplest option is having a read/write instance and leverage a geo-replication process to extend the data to a number of read-only regions.

Azure Redis Cache does have this capability in its premium tier, however it comes with a few limitations: you can only link 2 regions and it doesn’t work together with data persistence.

The good news is that we might have a good alternative :)

Azure Cosmos DB as your caching service

Azure Cosmos DB is the new serverless, geo-distributed, multi-model database available on Azure. One of the good aspects of it, is that it has all the technical features that we lack in Redis for the previously described scenario:

  1. It’s a database, and therefore it’s persistent. It seems trivial to say, but it’s not given for granted in a cache server such as Redis.
  2. It has a global presence. You can think about Azure Cosmos DB as a de-centralised server, which leverages geo-replication to have an endpoint close to your API server’s door. More in detail, its working model is with a single read/write zone and multiple read zones.
  3. It supports automatic eviction of items, given a TTL.
  4. It’s friggin’ fast :) with read times typically <10ms (more on this later!)
  5. It’s very scalable, as it can easily handle thousands of requests per second.

So, while working with a client who had such issues, I decided to consider Azure Cosmos DB as a replacement for Redis. Thanks to that, the diagram we’ve seen before becomes way more simple:

A distributed scenario with individual read databases in replica and only one Read/Write region

In this configuration, we only have the burden of populating the primary region, which is the only one that accepts writes. However, we still have one local read instance on each region, for performance reasons. The geo-replication functionality will make sure new data is seamlessly copied across all of them.

In other words, just one pre-loader to manage, monitor and control, and we use an out-of-the-box functionality of Cosmos DB to make sure that all the read instances are in sync. It does sound good, doesn’t it?

Great, let’s set it up!

So, the first step is obviously the one of creating an instance of Azure Cosmos DB, a database within it and a collection. There’s plenty of tutorials on how to do it in the official Azure documentation, so I won’t repeat these concepts here.

Once done it, the subsequent step is making sure that our collection — I’ve called it cacheItems in this example — has the TTL functionality enabled:

TTL is disabled by default, go to Scale & Settings to enable it

In the above image, I’ve set it as “no default”: this basically enables the functionality, but the documents won’t be expiring by default. This leaves us the capability of setting the TTL for each individual document, which is what we would expect as a standard cache behaviour.

Note: it’s fair to say that, as explained in the documentation, enabling the TTL functionality doesn’t incur in additional costs, as the eviction of expired items is free of charge.

How does it work? As you might know, every document in Cosmos DB has a _ts field, which is the last modified timestamp. Once the TTL has been enabled, the engine will look for a ttl field, which represents the Time to Live for the document, in seconds.

If that field is present, and the current timestamp is greater than the value of _ts + ttl, then the item is automatically deleted. You can easily give it a try from the portal: just create a very simple document, like the one below, and watch it automatically disappearing after 10 seconds!

This document will automatically be evicted in 10 seconds from its last change

Distributed cache and ASP.NET Core

ASP.NET Core comes with built-in support for distributed cache, via its IDistributedCache interface. There are a couple of implementations already available out-of-the-box, for Redis and SQL Server.

In order to ease the integration with Cosmos DB, I’ve created a proof of concept of how such a provider would look like. It’s available on GitHub, feel free to use it, clone it and improve it — I might even make a NuGet package out of it, one day — but please remember that it’s not meant for production use. Don’t blame me if it doesn’t work :)

Its implementation is quite trivial: the core logic is within the CosmosDbCache object, which implements IDistributedCache and it’s responsible of storing and retrieving the items from Cosmos DB. Please go and have a look at the code.

In order to plug it into your ASP.NET Core application, I’ve created a configuration method called AddDistributedCosmosDbCache. You can invoke it from your Startup class in this way:

As you can see, among the various parameters, we can even specify the preferred read locations. This is an absolutely important aspect. Let’s assume, for example, that your application is deployed in 4 regions worldwide. You can easily set up Cosmos DB to be replicated in the same data centres, as in the photo below:

If we go back to the snippet we saw previously, we can find out that we have configured US West as a preferred read location.

config.PreferredLocations.Add("US West");

Therefore, the client will always look for that instance, if available, when retrieving data. However, any write operation will occur against the UK South region, which is the only one configured as Write Region.

We’ve achieved what we were aiming for at the beginning of the article: we populate the cache once, and leverage its internal replica to make sure that the data changes eventually propagate to all the other regions involved. At the same time, we ensure the minimum possible latency thanks to the fact that we are reading from the local instance.

Some thoughts about costs and performance

The system we have designed is obviously eventually consistent. Replication doesn’t happen instantaneously and there’s always some latency to take into account. We can monitor it through the Consistency section of the Metrics blade.

Some replication metrics between US South and East US

As you can see, the transition latency is of the order of magnitude of tens of milliseconds, which is probably acceptable for any system whose data is static enough to be suitable for caching.

What about the read performance, though?

Well, my personal experience with Azure Cosmos DB is that its read latency is comparable to Redis’s performance: 2–3ms for a document of ~6KB.

Moreover its guaranteed by an SLA of <10ms per lookup read (a.k.a. read by Id, which is exactly the case we’re playing with, here) with a document of less than 1KByte.

In terms of cost and capacity, it all depends by a number of factors: number of req/sec we are willing to handle, as well as the size of the documents we are going to store.

At the end of the day, it’s all about the quota of Request Unit per second (RU/sec) that we are willing to purchase. As a rough estimate, let’s consider that:

  • If a document is less than 1KB, retrieving it by its ID costs 1RU. However, it doesn’t scale linearly with the size. 64KBytes cost only 10RUs.
  • There’s a minimum of 400RU/sec for a simple collection, which brings us to ~350reads/sec for small documents, if we keep some margin for the writes. This costs ~£17.28/month. Obviously this cost is per replica.
  • A Redis P1 instance costs the equivalent of ~6900RU/sec.

Every context is different, but I’ve started considering it for a number of use cases and the cost is mostly comparable to Redis premium, with the advantage of less headaches in a geo-distributed scenario. Microsoft provides a request unit calculator to help you determining the optimal capacity for your collection.

Wrapping up

When working in a geographically distributed system, keeping all the caches in sync poses several challenges. Unfortunately Redis — even in its premium tier — has limited replication capabilities.

In this article we’ve explored the possibility of using Azure Cosmos DB instead. This new multi-purpose database has way more flexibility when it comes to low-latency geo-replication, providing the chance of selecting multiple read-regions which are seamlessly kept in sync.

We’ve provided a proof of concept of a provider which integrates with ASP.NET Core and its support for distributed cache.

Read latency is generally below 10ms, which is in line with the expectations for a cache layer. Costs are affected by a number of factors, but generally are in line — if not cheaper — with a Redis Premium tier.

Marco De Sanctis is a tech entrepreneur and technology lover, based in London.

He’s a freelance consultant in IT, with 15 years of experience, and he’s been awarded as Microsoft Most Valuable Professional for the last 8 years. He’s also a book author, trainer, mentor and customary speaker at tech conferences.

His skills range from a very strong tech background in C#, ASP.NET, the whole Microsoft stack and Microsoft Azure Cloud infrastructure, to Solution Architecture, project and business management. His field of interest currently include Docker, Kubernetes, and highly scalable microservices architectures.

In 2015 he founded Cloud Consult London Ltd, a technology firm specialised in solution architecture for cloud-based systems.