The A-Z guide to Distributed Caching

15 min readFeb 7, 2023

Distributed caching is a technique widely used in various industries and applications, from e-commerce websites to social media platforms, so what is it? And how does it work? Let us get to know!

Before getting into Distributed Caching, let’s get a quick insight into Caching and why do we even use it in the first place.

1. What is Caching? How does it work?

Caching basically helps systems to work more efficiently by storing frequently used data in a fast-access location (such as RAM), which reduces the time and resources required to retrieve it when needed as compared to retrieving it from a slower storage location, such as a network server or a database.

The cache is usually managed by a caching engine or software, which determines what data should be stored in the cache, when it should be removed, and how it should be retrieved. The caching engine also typically uses algorithms to determine which data is the most frequently used and should be given priority in the cache.

And why do we use caching in web applications?

First, it reduces application latency by notches. Simply said, it doesn’t need to communicate with the hard drive when the user requests the data since it has all the frequently accessed data stored in RAM. This makes the application response times faster.

Second, it intercepts all the user data requests before they go to the database. The database is hit with comparatively lesser number of requests eventually making the application as a whole performant. Thirdly, this also helps in bringing down application running costs as Database Read Write operations are expensive.

Now that we have an idea on what caching is and how it works, let us jump into Distributed Caching.

2. What is Distributed Caching?

Distributed caching is a technique for improving the performance of access to data by storing frequently accessed data in memory across multiple nodes in a network. It basically maintains multiple copies across different systems in the network.

For a small, predictable number of preferably immutable objects that have to be read multiple times, an in-process cache is a good solution. However, for cases in which the number of objects that can be or should be cached is unpredictable and large, and consistency of reads is a must-have, a distributed cache is perhaps a better solution. It goes without saying that an application can use both schemes for different types of objects depending on what suits the scenario best.

Distributed caching is being primarily used in the industry today, for having the potential to scale on demand & being highly available.

The big scale internet services of today depend on scalability, high availability, and fault tolerance. Businesses cannot afford to have their services go offline. They are distributed across multiple nodes providing backup in all use-cases.

3. How does Distributed Caching work? — The Architecture

A distributed caching architecture typically consists of multiple cache nodes that are connected to each other and store a portion of the cached data. These nodes communicate with each other to keep the data consistent and to handle data replication and distribution.

The internal workings of a distributed cache can vary depending on the implementation, but the general process can be described as follows:

Client requests: A client sends a request for data to the cache.
Cache lookup: The cache node receiving the request first checks its local cache to see if the requested data is available.
Data retrieval: If the data is available in the cache, it is returned to the client. If not, the cache node retrieves the data from the database and stores it in the local cache.
Data replication: The cache node then replicates the newly acquired data to other cache nodes in the system to ensure consistency.
Cache invalidation: The cache periodically checks for stale or outdated data and removes it from the cache. This ensures that the cache remains up-to-date with the latest data.

There are only two hard things in Computer Science: cache invalidation and naming things.
— an unfunny famous saying

Cache invalidation in distributed systems is difficult for several reasons:

Data Consistency: Maintaining consistency between cache and data store can be challenging, especially when there are multiple instances of cache and data store running in a distributed environment.
Network Latency: Invalidating cached data across a large number of cache instances can take time and may be impacted by network latency.
Concurrent Access: Cached data may be accessed and updated by multiple clients concurrently, making it difficult to determine when it is safe to invalidate the cache.
Scalability: As the number of cache instances grows, it becomes increasingly difficult to manage invalidation messages and ensure that all instances have the latest version of the data.
Complexity: Implementing cache invalidation in a distributed system requires a deep understanding of the system architecture, network topology, and the algorithms used for cache management.

To overcome these challenges, sophisticated algorithms and protocols are used to manage cache invalidation in distributed systems.

Cache invalidation algorithms or cache eviction policies are used to determine when cached data should be removed from the cache. Some common cache invalidation algorithms are:

Time-based Invalidation: This algorithm invalidates the cache after a certain period of time has elapsed since the data was cached.
Version-based Invalidation: This algorithm assigns a version number to each cached item and invalidates the cache whenever the version number changes.
Delta-based Invalidation: This algorithm compares the current version of an item in the cache with the latest version available on the server, and invalidates the cache if the delta between the two versions exceeds a certain threshold.
Least Recently Used (LRU) Invalidation: This algorithm invalidates the cache based on the last time an item was accessed. The item that has not been accessed for the longest period of time is evicted from the cache when it reaches its maximum size.
Most Recently Used (MRU) Invalidation: This algorithm invalidates the cache based on the last time an item was accessed. The item that was most recently accessed is evicted from the cache when it reaches its maximum size.
Least Frequently Used (LFU) Invalidation: This algorithm invalidates the cache based on how frequently an item is used. The item that is used the least is evicted from the cache when it reaches its maximum size.
Hybrid Invalidation: This algorithm combines the advantages of multiple invalidation algorithms to provide better performance. For example, it might use time-based invalidation for items that change infrequently, and LRU invalidation for items that change frequently.

The choice of cache invalidation algorithm depends on the specific requirements of the system, such as the rate of change of the data, the size of the cache, and the trade-off between cache hit rate and network overhead.

Let us look into the applications of these algorithms and how they are modified according to requirements.

4. Distributed Caching Systems — The Applications

Distributed caching systems are widely used in large scale web applications to improve performance and scalability.

Some real-world applications of distributed cache systems include:

E-commerce websites: Large e-commerce websites use distributed cache systems to store product catalogs, user sessions, and shopping carts to reduce the load on their databases and improve page load times.
Social media platforms: Social media platforms use distributed cache systems to store user profiles, timelines, and feeds to handle high traffic spikes and ensure low latency for users.
Gaming platforms: Gaming platforms use distributed cache systems to store game states, leaderboards, and in-game items to handle high-concurrency game play and ensure low latency for users.
Financial services: Financial services use distributed cache systems to store financial data, such as stock prices, exchange rates, and portfolio information to handle high-frequency data updates and provide real-time data access to users.

There are several different open-source distributed caching systems available for free, such as Memcache and Redis, which can be easily integrated into an application.

Now, let us look into how real-world large scale applications use these systems.

4.1. Scaling MEMCACHE at Facebook

Memcached is a well known, simple, in-memory key-value store that is widely used to cache data in web applications.

Data is stored in Memcached as key-value pairs in memory. The key is used to identify the data and the value is the actual data being stored. The key-value pairs are stored in a hash table, allowing for fast access to data based on the key.

When data is stored in Memcached, it is stored in memory, providing fast access to data and high performance. If the cache reaches its maximum size, the least recently used (LRU) eviction policy is used to determine which data to remove from memory.

The data stored in Memcached is read-intensive and typically accessed multiple times, allowing Memcached to serve as a layer of abstraction between the database and the front-end of the platform.

The simplicity and efficiency of this data storage mechanism makes Memcached a popular choice for large scale web applications looking to improve performance and scalability, one such example is Facebook.

Let us see how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world’s largest social network.

Here’s a high-level overview of how Memcached works in Facebook:

Request: When a user makes a request on the Facebook platform, the request is first sent to the cache layer.
Cache lookup: The cache layer checks if the requested data is stored in Memcached. If the data is found, it is returned to the user, otherwise, the request is passed to the database.
Database query: If the data is not found in Memcached, it is retrieved from the database and stored in Memcached for future use.
Response: The data is returned to the user as a response to their request.

By serving as a cache layer, Memcached helps to reduce the load on the database and improve the performance of the Facebook platform.

Facebook has made several modifications to the standard Memcached architecture to better suit its needs:

Custom network protocol: Facebook has developed a custom network protocol to communicate with Memcached, allowing it to better handle high volumes of traffic and improve performance.
Large cluster size: Facebook operates one of the largest Memcached clusters in the world, with thousands of cache nodes to handle the high volume of traffic on its platform.
Custom eviction policies: Facebook has implemented custom eviction policies, such as the Adaptive Replacement Cache (ARC) policy, to improve the efficiency of its cache and reduce evictions.
Consistent hashing: Facebook uses consistent hashing to distribute data across its Memcached nodes, allowing it to scale horizontally and handle increased traffic.
Monitoring and automation: Facebook has built a robust monitoring and automation system to manage its Memcached cluster, allowing it to quickly detect and resolve issues.

By making these modifications, Facebook has been able to optimize its use of Memcached to handle the high volume of traffic on its platform and provide low latency experiences for its users. The modifications have also allowed Facebook to scale its caching infrastructure to meet the demands of its rapidly growing user base.

4.2. Redis — A case study

Redis is another open-source software that has a simple, yet flexible, architecture that consists of the following components:

In-memory data store: Redis stores all data in-memory, providing fast access to data and high performance. The in-memory nature of Redis also allows it to persist data to disk for durability.
Data structures: Redis supports multiple data structures, including strings, hashes, lists, sets, and sorted sets, allowing it to be used for a wide range of use cases.
Pub/Sub messaging: Redis provides a pub/sub messaging system that allows clients to subscribe to channels and receive messages in real-time.
Lua scripting: Redis supports Lua scripting, allowing users to execute custom logic within Redis and perform complex operations on data.
Clustering: Redis supports clustering, allowing users to distribute data across multiple nodes for increased scalability and reliability.

Redis is capable of supporting billions of key-value pairs with low latency and high throughput. It is a highly scalable and reliable in-memory storage system suitable for large-scale web applications.

One example of a large scale application that uses Redis is Twitter.

Twitter uses Redis in several ways to improve its performance and scalability:

Caching user timelines: Twitter caches the timelines of its users in Redis to reduce the load on its databases and ensure low latency for users.
Storing real-time data: Twitter uses Redis to store real-time data, such as trending topics, in-memory to provide real-time insights to its users.
Session management: Twitter uses Redis to store user sessions, allowing it to scale horizontally and ensure high availability for its users.
Caching API responses: Twitter caches API responses in Redis to reduce the number of database queries and improve the performance of its APIs.

By using Redis, Twitter is able to handle high volumes of traffic and provide low latency experiences for its users, even during peak times. The in-memory nature of Redis also allows Twitter to process real-time data quickly and provide real-time insights to its users.

You might be thinking, Which one to choose? Memcached or Redis?

So, It depends on the use case. Memcached is simple, often used for caching frequently accessed data in web applications to reduce the load on databases.

Redis, on the other hand, is a more versatile, supports a wider range of data structures such as strings, hashes, lists, sets, sorted sets, and bitmaps, and also offers persistence options and advanced features.

In general, if you need a simple cache for infrequently changing data, Memcached might be sufficient, but if you need more complex data structures, persistence, or advanced features, Redis would be a better choice.

However, each has flaws of its own, such as Memcached’s restricted scalability possibilities and Redis’ greater memory needs. To combat them and boost the overall performance and scalability of the systems, new techniques or algorithms are constantly being proposed.

Let us look into some systems that were recently proposed, that we found to be interesting.

4.3. CACHE: A Scalable and Dynamic Distributed Cache System

Here, we try to address the challenge of designing cache systems that can adapt to changing performance requirements and system configurations. The proposed cache system, CACHE, is designed to reduce the number of memory accesses to slow storage, increase the hit rate in cache memory, and adapt to changing access patterns over time.

The system is based on a hierarchy of caches, where each cache level has a different size and speed, starting with small, fast levels and ending with large, slow levels. Data is dynamically partitioned into the different cache levels based on its popularity and recency of access. The system is designed to scale to large numbers of nodes and large amounts of data, and it uses techniques such as load balancing and data partitioning to ensure that the system remains balanced and efficient as it grows.

One of the key features of the proposed mechanism is its ability to handle invalidations efficiently. The mechanism uses a combination of invalidation algorithms, such as push-based and pull-based invalidation, to ensure that data is updated efficiently and consistently across all cache levels.

In summary, CACHE is a powerful and adaptable cache system that is comprised of multiple cache servers connected via a high-speed network, and uses dynamic partitioning to ensure that the cache data is evenly distributed across the servers. Evaluations of CACHE show that it outperforms other cache systems in several important metrics, including cache hit ratio, response time, and scalability. CACHE promises optimal performance in any environment.

4.4. Scalable Distributed Caching with Dynamic Resource Allocation (SDC-DRA)

We try to address the challenge of designing cache systems that can effectively manage the trade-off between cache hit rate and cache resource utilization in large-scale distributed systems. The proposed system, called SDC-DRA, aims to solve these issues by dynamically allocating cache resources based on the workload and cache utilization.

The mechanism is based on a cluster of caching nodes, each of which is equipped with its own cache memory.

One of the key features of the mechanism is its use of a prediction model to dynamically allocate cache resources. The prediction model takes into account factors such as the popularity and recency of data, as well as the current load on the system, to determine the best cache allocation for a given data item. This allows the system to make effective use of cache resources and to improve cache hit rates.

Overall, the mechanism proposed in “Scalable Distributed Caching with Dynamic Resource Allocation” provides a comprehensive solution for improving the performance of large-scale, data-intensive distributed systems. The detailed evaluation of the system’s performance demonstrates its ability to deliver high cache hit rates while minimizing the cache resource utilization and maintaining high scalability.

What’s the difference between memcached or redis and the above proposed systems?

One of the key differences between the mechanisms proposed and open-source caching solutions like memcached or redis is the dynamic resource allocation. This allows the system to make effective use of cache resources and to improve cache hit rates. In contrast, open-source caching solutions like memcached and redis typically use static cache allocation, which may not be as effective in adapting to changing access patterns.

Another difference is in the scalability of the mechanism. The mechanisms proposed are designed to handle large amounts of data and a large number of nodes whereas memcached and redis may not be as scalable or efficient in large-scale, data-intensive distributed systems.

Both CACHE and SDC-DRA utilize dynamic allocation, right? What’s the difference?

Both mechanisms are designed to be scalable and efficient in large-scale, data-intensive distributed systems. However, the key differences would be:

Approach to dynamic resource allocation: The mechanism proposed in “CACHE” uses a load balancing algorithm, whereas the mechanism proposed in “Scalable Distributed Caching with Dynamic Resource Allocation” uses a prediction model.
Invalidation Mechanism: “CACHE” uses a combination of push-based and pull-based invalidation algorithms, whereas “SDC-DRA” uses a pull-based invalidation mechanism that periodically checks the cache data for consistency.

In summary, both mechanisms have their own strengths and weaknesses, and the choice between them will depend on the specific requirements of a particular system.

Conclusion

Distributed caching is a technique for improving the performance of access to data by storing frequently accessed data in memory across multiple nodes in a network.

To manage cache invalidation in distributed systems, sophisticated algorithms and protocols are used to determine when cached data should be removed from the cache. The choice of cache invalidation algorithm depends on the specific requirements of the system, such as the rate of change of the data, the size of the cache, and the trade-off between cache hit rate and network overhead.

Large-scale web applications frequently use distributed caching systems to improve performance and scalability. There are numerous open-source caching systems online, including Memcached and Redis. We saw how Facebook utilizes Memcached and Twitter uses Redis. Every system undoubtedly has flaws, and finding ways to fix them is a constant task. Both CACHE and SDC-DRA, two recently proposed systems that make use of dynamic resource allocation, produced notable outcomes in the corresponding evaluations performed.

Finally,

This blog is co-authored by Trinibhaskar, Abhilash Datta, Rohit Raj, Sunanda Mandal, Matta Varun and Haasita Pinnepu.

References

https://research.facebook.com/file/839620310074473/scaling-memcache-at-facebook.pdf
Chen, Shanshan & Tang, Xiaoxin & Wang, Hongwei & Zhao, Han & Guo, Minyi. (2016). Towards Scalable and Reliable In-Memory Storage System: A Case Study with Redis. 1660–1667. 10.1109/TrustCom.2016.0255.
“CACHE: A Scalable and Dynamic Distributed Cache System” (2007) by Lei Shi, Youcef Rahal, Dan Feng, Thomas F. Wenisch.
“Scalable Distributed Caching with Dynamic Resource Allocation” (2011) by T. Su, J. Lu, Y. He, Z. Li, J. Li.
“Distributed Cache Invalidation Algorithms for Web Caching Systems” by D. Kim et al. (2003)
“Performance and Scalability of Distributed Caching Systems: A Comparative Study” (2018) by A. Othman, M. S. Qureshi, and M. Alghamdi