Page Cache Everywhere: From OS to Cloud Storage

Vladimir Rodionov
Carrot Data Engineering Blog
7 min readNov 6, 2024
Thank you, DALL-E, you are the abstract genius

Quick intro to page caching

Page caching is the OS’s way of speeding things up by storing copies of recently accessed data in memory (RAM) instead of going back to the slower disk each time. When a program needs to read a file, the OS first checks the page cache to see if the data is already there. If it is, great — it serves it right from memory, which is much faster than reading from disk. Why is it called Page Cache? Because the OS stores data in small chunks called “pages” (typically 4KB in size), rather than caching entire files. This approach allows the OS to efficiently manage memory by only caching the parts of a file that are actively needed, rather than loading the whole file into memory.

The OS uses any extra RAM to hold onto recently or frequently accessed data, making everything feel snappier. If memory starts getting tight, the OS simply clears out some of the cached data to make space for other applications.

Page caching isn’t just for reading files; it also helps with writing. When a program writes data, it can be temporarily held in the cache before being saved to disk, bundling up writes to make them more efficient. This feature is widely used in operating systems to improve everyday tasks like opening files, launching apps, or managing databases, giving you a faster experience without needing extra hardware.

Page cache, originally designed to speed up file access within an operating system (OS), has expanded to support performance improvements across various computing domains. Here are several notable uses beyond OS-level file caching: Remote File Storage Cache, Databases, Web Servers and Content Delivery Network (CDN), Distributed File Systems and Object Storages, Big Data Processing and Distributed Computing Frameworks

Page cache has extended far beyond its original role in the OS, becoming a critical performance enhancer across many domains, from remote storage to distributed computing frameworks. Each use case adapts the fundamental concept of page caching to suit specific data access patterns and workload needs.

The Rise of Disaggregated Compute And Storage Systems

You’ve probably heard of them already — they’re everywhere in the major clouds: Amazon Aurora, Snowflake, Databricks, data warehouses, data lakes, lakehouses, and distributed SQL query engines like Presto, Trino, StarRocks, and Dremio are just a few examples. Different systems, but a common design: disaggregated compute and storage. Storage is typically cloud-based (e.g., S3, Azure Blob Storage, Google Cloud Storage), which introduces unpredictable latency, imposes API call limits, and incurs a cost per access — all of which impact system performance and drive up your cloud bill.

Picture 1. Caching solution for disaggregated compute and storage systems (www.alluxio.io)

In disaggregated compute and storage systems, page caching is essential for minimizing latency, reducing costs, and providing consistent, high-speed access to frequently used data. It helps bridge the gap between compute and storage resources, allowing applications to maintain high performance and scalability even when storage is accessed over a network. Without page caching, the added latency and variable performance of networked storage could hinder the effectiveness of disaggregated architectures, especially in cloud environments where speed and efficiency are paramount.

Page caching for cloud stores

Many vendors provide caching solutions tailored for cloud and remote file storage. Some offer dedicated, general-purpose caching layers, while others integrate caching as a key component within their products — examples include Snowflake and Databricks data warehouses, Datastax Astra DB, and Amazon Redshift.

While certain solutions cache entire files, most adopt a page caching approach, where files are divided into non-overlapping, adjacent regions or “pages” that are cached locally. This method allows for efficient storage and retrieval of only the necessary portions of data, optimizing performance and reducing latency. By caching data at the page level, these systems improve access times for frequently used data while minimizing storage and network overhead.

Page-as-a-Single-File (PaaSF) Design Approach

In most, if not all, remote storage page caching systems, each data page is stored as a separate file (e.g., in Alluxio Data Platform, DataStax Astra DB, and Celerdata RockStar SQL engine). Supporting metadata is kept in memory to track and manage these pages. This metadata is essential for functions like page eviction (tracking page popularity) and basic indexing (such as mapping pages back to specific files). However, this metadata overhead can be significant in some systems; for instance, in Alluxio, each data page requires a minimum of 250 bytes of memory for tracking and indexing.

This memory overhead is manageable as long as data pages are reasonably large — this is why many vendors set the default page size to 1MB or larger. With a 1MB page size, the system only needs to manage hundreds of thousands to a few million pages in the local cache, which is feasible within typical system memory limits.

However, there are cases where this standard 1MB page size creates issues. For example, in OLAP applications that work with columnar data formats like Parquet and ORC, approximately 50% of remote storage requests are for data chunks under 10KB. In these situations, not only do different files vary in popularity, but so do specific regions within files. When systems like Spark, Presto, or Trino request a small 10KB section of data, the cache system fetches a full 1MB page, resulting in read amplification.

Read amplification has two negative consequences:
1. Cache Inefficiency: Loading unnecessary data into the cache “trashes” it, consuming valuable space with data that may never be accessed again. This lowers the cache hit ratio and increases the number of requests to remote storage.
2. Performance Degradation: Loading 1MB from remote storage takes significantly longer than loading only the requested 10KB, increasing latency and impacting application performance.

So, what’s the solution? It turns out that a universal 1MB page size may not be optimal for all use cases. For OLAP applications, especially those accessing small data chunks, an adaptive or smaller page size may be more efficient, potentially well below 1MB, to minimize read amplification and optimize cache usage.

Can PaaSF Systems Efficiently Handle 10KB Pages?

To some extent, yes, but managing millions of small files in a file system presents several potential performance and operational challenges:

1. Memory Overhead for File System Metadata. Storing millions of files significantly increases the metadata footprint in RAM. In modern Linux systems, metadata includes inode and dentry caches — about 128 bytes per file for inodes and 128–256 bytes per directory entry. This overhead grows quickly as file count rises, consuming substantial memory resources.

2. Degraded Lookup Performance. With millions of small files, it becomes challenging to keep all file system metadata cached in RAM. When inodes or directory entries are missing from cache, additional I/O operations are needed to read data from disk, causing lookup latency to increase. Each uncached lookup requires multiple I/O operations (for inodes and directory entries), further slowing down the system.

3. File System Maintenance and Practical Limits. Maintaining a file system with hundreds of millions of files is complex and resource-intensive. For instance, running fsck (file system consistency check) on a massive file system can take hours, or even days, to complete. This type of maintenance not only impacts system availability but can also require substantial compute and memory resources.

4. Application Metadata Overhead. Application-level caching solutions also incur metadata overhead. For example, Alluxio requires at least 250 bytes of RAM per data page to store necessary metadata (for page eviction and indexing). This means that caching 100 million 10KB pages (1TB total) requires 25GB of RAM just for metadata, a significant resource commitment.

5. Prolonged Startup and Shutdown Times. As the cache size grows, the time required for startup and shutdown also increases. For a cache with 100 million pages, startup or shutdown could take minutes or even dozens of minutes.

6. SSD Wear and Longevity Concerns
Large numbers of small files with random write I/O patterns are detrimental to SSD longevity. SSDs are optimized for sequential access, both for reads and writes. Frequent, small, random writes increase write amplification, accelerating SSD wear. For PaaSF systems using SSD storage, this can lead to faster degradation and higher replacement costs.

Conclusion
From a practical perspective, 10KB page sizes are well outside the comfort zone for most PaaSF systems. While it’s feasible, the overheads — in terms of memory, I/O latency, system maintenance, and SSD wear — make it an inefficient choice. For these systems, larger page sizes (closer to 1MB) are generally more manageable, providing a better balance between performance and resource efficiency.

Is There a Better Solution?

Yes, there is: Carrot Cache. This powerful caching technology drives our flagship product, Memcarrot (all relevant links are in the References section). We conducted a benchmark with Carrot Cache on an AWS i3.metal instance handling 1 billion pages, and the results speak for themselves:

- Total Dataset Size: ~10TB (page size = 10KB)
- Number of Files Created: ~29K
- Total Metadata RAM Overhead: ~17GB (including disk dataset index and eviction support)
- Cache Server Startup/Shutdown Time: ~12 seconds. The cache is persistent.
- Write Pattern: No random writes — all writes are sequential, which is ideal for SSD longevity
- Page Read Throughput: ~3.5GB/second (~350K pages/second)

In essence, Carrot Cache addresses 5 out of the 6 challenges we discussed previously, and partially addresses the last one as well. For handling the full range of metadata requirements, including efficient data page indexing, we developed Lodis — an embeddable Redis for Java. Lodis is incredibly efficient at managing large collections and maps in memory, making it ideal for applications with extensive metadata. But that’s a topic for a future post — stay tuned for more on Lodis!

References

  1. Carrot Cache — https://www.github.com/carrotdata/carrot-cache
  2. Memcarrot —https://www.github.com/carrotdata/memcarrot
  3. Carrot Data — https://trycarrots.io
  4. Our Engineering Blog: https://www.medium.com/carrotdata

--

--

No responses yet