Caching Strategies in High Throughput Systems
Caching is a very powerful tool. In high throughput distributed systems, caching is frequently imperative. But adding cache to your services comes with a cost and introduces a whole new set of problems. Even more, choosing the wrong caching strategy can cause big problems in your app. In this post, I will review four different caching strategies, including the pros and cons of each, and hopefully give you some tools to decide what is the most suitable caching strategy for your service.
The problem
You have implemented a service that given a stock name, returns the stock price. At the beginning, you had around 100 requests a minute. For every request, you went to your SQL DB and queried the relevant stock. Luckily, you did a great job and your service became very popular. But now, some of your users are complaining about timeouts and very long loading times. Looking at your metrics, you realize you have 10,000,000 requests a minute and that querying from DB is not an option anymore (takes too much time no matter how good the indexes are). You decide to speed things up a bit by storing the stocks data in your application memory. Congratulations! By doing so, you have just added a local cache to your service.
First Caching Strategy: Scheduled Preloaded Cache
Personally, I love this caching strategy, mainly because of how simple it is. In this methodology, we will load all values to cache once every X minutes. Obviously, getting the data from memory is much faster than querying from DB. So, problem solved! now your stocks service can serve 10,000,000 requests a minute. Yes… but after a few weeks you realize that your service uses 10GB of memory just to store the stocks table. Users also started complaining about the data freshness of your service. They complained that the prices were old and irrelevant.
The story above gives us a hint of the problems with this caching strategy, but I would like to start with its advantages. The first is its simplicity. This caching strategy is very easy to understand and implement. Personally, I think that in an environment like ours, where most things are so complicated, simplicity is to be prized. Another big advantage is the hit rate. Hit rate is an important measure to look at when implementing a cache. It is measured by dividing [requests that get answers from cache] / [total requests]. In scheduled preloaded cache, the hit rate is 100% as all requests will get a result from your cache. (The next caching strategy I will present won’t have a hit rate of 100%.)
The disadvantages are quite significant, though, which make this caching strategy suitable only for a handful of cases. The first issue we mentioned is the size of the data. We usually don’t want to save 10GB of data in memory. Another issue is the startup time of the service. In a micro services environment, there is a big importance to a service startup time. This caching strategy causes startup time to increase drastically as you will have to load all the data in advance. More than that, this strategy is irrelevant when you don’t know all possible values in advance.
The most important flaw of this caching strategy though is data freshness. The data in our cache is not fully synced with our DB as changes will be loaded to cache only after X minutes. In systems like the stocks service, where pricing changes every few seconds, this caching strategy will frequently return old and irrelevant prices.
All in all, this strategy is great for data that does not change frequently, where data freshness is not that important and the size of the data is relatively small. I often use this cache to save small tables with a finite amount of values. For example, “Country” table is a good table to store with that caching strategy.
Pros:
- Simple to implement and understand.
- 100% hit rate.
Cons:
- Can consume a lot of memory from your service.
- Data freshness: DB and Cache are synced only every X minutes.
- Will raise your service uptime as you have to load all values before your service starts.
- If you don’t know all values in advance, you can’t use this caching strategy.
Second Caching Strategy: Read Through Cache
The read through caching strategy is the most common one. In this caching strategy, our service will first search for the value in our local cache; if the value exists, we will return it to the user. When value does not exist in cache, we will query it from DB, return the result to the user and save it to our cache. That way, the next user asking for that value will get it from the cache in no time.
Now, entries can’t live forever in our cache. There are two main reasons for this. First, we want to refresh our entry in case it changed. Second, we don’t want to keep values that are not being requested in our cache. In order to address these issues, we need to define an eviction strategy. A very common eviction strategy is choosing a period of time in which our value will be removed from cache after the time has passed. This period of time is usually called TTL (Time To Live).
The read through caching strategy has many advantages over the previous strategy, Scheduled Preloaded Cache. First of all, read through caching is very versatile and can fit a large variety of cases, mainly due to the fact that you can choose many different strategies for evicting and populating entries. Let’s take our stocks cache, for instance: we probably won’t have the 10GB memory issue anymore, as we will save to our cache only stocks that are being requested. (Unlike the previous cache when we saved to cache even stocks that were never requested).
But let’s assume our cache takes 5GB of memory right now and we still think it’s too much. We can change our eviction strategy and limit our cache to 1GB. Now our eviction strategy will be 5 minutes TTL, and if we exceed 1GB we will remove the least queried entry and enter a new key. There are no free lunches, of course; this change will impact our latency as we will have a much lower hit rate. But the fact that you can choose and find the optimal balance between memory consumption and latency is a great advantage. Another advantage of this caching strategy is the service startup time. Unlike the previous cache, we don’t raise the startup time as our cache will be empty on startup. Finally, another advantage is the fact that we don’t have to know all the values in advance, again unlike the previous caching strategy.
This caching strategy though has 2 major flaws. The first is the hit rate. Some of our users will still have to go to DB in order to get the data (in cases where value is asked for the first time or after TTL has ended). Another major issue is data freshness. This type of cache, just like the previous one, does not solve our major issue of data freshness. If, for example, our TTL is 5 minutes, then in the worst case, some users will get the stock price as it was 4 minutes ago.
Pros:
- Very versatile cache, can fit many cases.
- Enables control of the cache size.
- Will not impact the service uptime. (Service will start with empty caches.)
- You don’t have to know all the values in advance.
Cons:
- A bit harder to implement.
- Hit rate will always be lower than 100%. When not implemented well, the hit rate can drop significantly which impacts the response time.
- Data Freshness. Same problem as previous cache the freshness of the data is limited by TTL
Third Caching Strategy: Write Through Cache
This third caching strategy is a very strong one and unfortunately isn’t known enough.
The write through caching strategy is best for cases where data freshness is very important. In this strategy, we change the way we write data in our system. In addition to writing the new stock price to the DB, we also write it to our cache. We will still have the read through caching mechanism. We will look for a value in cache and if it exists we will return it. If it does not exist, we will get it from our DB, return it to the user and save it to cache. But now when a new value of stock is written to our system, we will also check if it exists in our cache. If it exists, we will update the value in our cache as well. That way, our cache will always be synced and will look exactly like our DB.
Although it sounds like we can drop the eviction strategy as our data will stay updated, this is not so — we will still need to define TTL or another eviction strategy. After all, we don’t want to hold values that are not used anymore in memory. Consider what would happen if your service runs for a year. With no eviction strategy, your cache will hold deleted stock prices for stocks that do not even exist anymore.
Just like the read through cache strategy, we have all the advantages of versatility, fast startup time and the ability to control our cache size. Unlike read through cache though, this caching strategy solves beautifully the data freshness problem. Our cache will always look exactly like our DB, thanks to the fact that we update our value every time we write data to our DB.
However, this superpower comes at a price. First of all, save and update actions in our system will now take longer. It may sound like a small issue but in services where writing to DB is very frequent, this extra step will cost CPU, memory and leave you with multiple edge cases you will have to take care of in order to keep both DB and cache synchronized. Besides that, we still have the lower hit rate disadvantage, just like the read through cache strategy.
Despite these shortcomings, I still think that this is the most suitable strategy for our stocks service — assuming data freshness is the most important feature in such a service and assuming that we have 10GB worth of stock data, which we can’t keep in memory in its entirety.
Pros:
- Data freshness: Local Cache and DB are always synchronized.
- Very versatile cache, can fit many cases.
- Enables control of cache size.
- Will not impact the service uptime. (Service will start with empty caches.)
- You don’t have to know all the values in advance.
Cons:
- Hit rate will always be lower than 100%.
- Write actions in your service takes longer, consume more CPU.
- Can be very complicated to implement and maintain synced cache and DB.
Fourth Caching Strategy: Preloaded Write Through Cache
The preloaded write through caching strategy, just like write through caching strategy, is best for cases where data freshness is very important. But in order to use this strategy we also need to make sure we can hold all our values in memory. Just like we did in the regular write through cache, with this strategy we also change the way we write data in our system. In addition to writing the new stock price to the DB, we also write it to our cache.
There is another important step of course; on service startup, we should load all existing data from DB. After that, we will keep the cache synchronized with the DB by updating the cache every time we delete, create or update a stock. With this mechanism, we can drop the refresh every X minutes as we are already keeping the cache updated and synced with the DB.
Just like the scheduled preloaded cache strategy, one of the major advantages is the hit rate. We will have a 100% hit rate because we hold the whole table in cache. We also solved the data freshness problem, just like in the write through caching strategy.
The disadvantages, however, are quite significant. Here we will also pay the price of CPU, memory and complications from the write through mechanism. Besides that, this caching strategy also has the same issues of high memory consumption and high service startup time, as we are holding the whole table in memory. For that reason, this cache is useful in very specific cases.
Pros:
- 100% hit rate.
- Data freshness: Local Cache and DB are always synchronized.
Cons:
- Can consume a lot of memory from your service.
- Will raise your service uptime as you have to load all values before your service starts.
- Write, update and delete actions in your service takes longer, consume more CPU.
- Complicates the easy mechanism of scheduled preloaded cache.
Distributed Cache vs Local Cache
I will start by saying that this topic could fill a blog post all on its own. But I feel it is important to mention here, because we have so far only discussed local cache and things are a little different in distributed cache.
First of all, it’s important to understand that all the caching strategies mentioned above can be used on both local cache and distributed cache. The main difference is the cache’s location. In local cache, the machine that runs your service is the same one that holds the cache (sometimes it even holds it in your program’s allocated RAM). This is unlike distributed cache, which is usually another machine with a lot of memory, some CPU, and a very small program that is only responsible for saving and getting data from memory. Side note: the next evolution of distributed caches was the In-memory DB, such as Redis, that actually stores tables and does queries like a regular DB.
Now, it may seem a bit odd to save the cache in another machine. It will surely be slower than saving it in your local machine, as we now added all the network overhead. That means we have to send a request in some communication protocol (UDP/TCP) to another machine and wait for its response.
Distributed cache really does not make a lot of sense if your service is running on a single instance. But in reality, most services don’t have such simple architecture.
Usually our services will have multiple instances on multiple machines and some load balancer that navigates the traffic between them. In such cases, assuming we have 20 instances, saving 10GB of stocks data on each instance is a big problem. A good solution is creating a single instance with even 30GB RAM that will have the role of distributed cache and serve all our instances.. Although getting data from a distributed cache is slower than local cache, it’s much faster than getting it from DB.
Another great advantage of distributed cache compared to local cache is that our cache stays warm on service startup. Thanks to the fact that the cache is on another machine, it will stay warm even if you deployed a new version of your service, or added new instances due to a spike in traffic.
You can even combine the usage of distributed cache and local cache in your systems. For example, create a distributed cache with a preloaded write through caching strategy that will hold all the stocks table. On your service instances, create a cache with a read through caching strategy that saves up to 1GB of most requested stocks. That way, the most requested stocks will be saved on your local cache and cache misses will go to the distributed cache instead of your DB, which will also save you some precious milliseconds.
In Conclusion
I know it’s a bit cliche, but there is no single caching strategy that is the best one. You should always design the cache to best suit your service needs. Most of the time, that’s a lot harder than it sounds, as you will have conflicting needs. The art is to find the ‘good-enough’ balance. I hope that this blogpost has given you some deeper knowledge about caching strategies and explored the tools you need to properly compare and design the best caching strategy for your specific service.