Scaling High-Performance In-Memory Caches with ChronicleMap

DV Engineering

Published in

DoubleVerify Engineering

8 min readJun 26, 2024

Written By: Naved Khan

The Challenge of Scaling In-Memory Caches

One of the applications we use to integrate with various programmatic and demand-side platforms (DSPs) is a low-latency, high-throughput JVM-based application. This is the core component of DV’s pre-bid verification solution. Since successfully launching this solution many years back, we have continued to add several critical features while maintaining strict latency SLAs and processing hundreds of billion requests daily.

The size of measurement data — fraud, brand safety, etc. — used for pre-bid evaluation has grown tremendously in recent years. As datasets grew, so did the size of our in-memory caches. At one point, we faced the critical challenge of scaling up our largest in-memory cache to support a new feature launch that could quadruple our cache size. We had strict low-latency response SLAs, various deployment options to support, and a limited JVM allocation below 32GB to use compressed object pointers. After evaluating several options, we decided to utilize Chronicle Software’s open-source ChronicleMap as our solution.

In this blog post, I will discuss our journey of evaluating various open-source and enterprise solutions and why we ultimately chose ChronicleMap. I will also share our best practices and learnings from our experience. This post does not get into the technical implementation details, which are readily available in the online documentation for each solution mentioned.

The Problem at Hand

Our existing caching solution could not handle the increased workload for the above reasons, so we needed a new solution that could scale up while still meeting the strict latency and deployment requirements. The application serves two main types of pre-bid integrations: — (1) cached solutions, with the DSP caching our responses at their end and thus do not have strict latency requirements, and (2) other integrations, which, in contrast, use live in-bid stream responses requiring a latency requirement of less than 10 milliseconds end-to-end.

This application needed to be hosted on-premise at DV, in the cloud, and even on-premise at some of our partners’ locations to make the requirements more stringent. We could potentially have multiple versions of the application but decided to have a single version to reduce deployment and maintenance complexity, our release pipelines and maintainability. So, any solution we chose needed to comply with the above requirements. We evaluated several options to find the best one for our needs.

Evaluating Various Open-Source and Enterprise Solutions

To find a solution that met our requirements, we evaluated several open-source and enterprise options, including:

Redis — a popular open-source, in-memory key-value data store.
EhCache — a widely used, open-source Java cache.
Aerospike — an enterprise-grade distributed caching solution that supports high-throughput, low-latency caching.
Hazelcast — an open-source solution for distributed computing and in-memory caching.
ChronicleMap — an in-memory, key-value store designed for low-latency and/or multi-process applications.

Why We Decided to Utilize ChronicleMap

We researched and evaluated the advantages and limitations of the above options. While each was great in its own way, we needed something very specific for our unique combination of requirements.

We ruled out any solution that required making out-of-process lookups due to the additional time required to complete the lookup compared to the speed of accessing local RAM.

This would risk breaking our overall end-to-end latency requirement of 10 milliseconds. Furthermore, creating server clusters in each region for these solutions would add significant maintenance and infrastructure cost overhead for us and, more importantly, for on-premises partners hosting our application. So, this ruled out Redis and Aerospike.

While Terracotta’s EhCache seemed to provide good performance for smaller caches, it was not suitable for our needs due to its limitations in scaling large caches and its lack of support for off-heap storage in the open-source version. Terracotta’s off-heap storage solution, BigMemory, was only available as a commercial solution.

Hazelcast showed good performance and scalability but required significant configuration and tuning to achieve optimal performance. Moreover, Hazelcast’s off-heap solution, High-Density Store, was only available commercially.

In contrast, ChronicleMap stood out as the best fit for our needs as it met our core requirements.

It was optimized for low garbage collection and provided high-performance storage, retrieval and iteration of data.
It could be implemented as part of our application, allowing us to keep deployments simple without the need to maintain a separate cluster of servers for an external cache.
It was implemented as an in-memory, off-heap solution that could scale further using the disk. Since it was off-heap, we could scale the in-memory caches while keeping our application’s on-heap memory allocation under 32GB.
The features available in the open-source version seemed sufficient for our requirements — mainly in-memory, off-heap map, and persistence to disk for fast application restarts.

We also found that ChronicleMap was widely used in production at banks and hedge funds globally, which have to deal with a similar scale as DV. We felt it was a good fit. We implemented a proof of concept (POC), which showed promising results. At this point, we felt confident about choosing ChronicleMap to scale our in-memory caches.

Our Experience and Best Practices

We have been able to scale our in-memory caches as our data sets have grown without sacrificing low-latency response times.

The following are the notable improvements we see in our application by utilizing some of the features offered by ChronicleMap:

Ability to scale the datasets

After we successfully moved the page classification categories to the off-heap cache and launched the feature dependent on scaling this, we started migrating other data sets. The lookup latency for ChronicleMap caches added negligible extra time (nanoseconds) compared to the lookup latency for the original HashMaps. We were finally able to support scaling each of our datasets by multiple factors if needed, which was a huge benefit.

Improved garbage collection

We noticed a considerable improvement in reducing garbage collection after moving many of our datasets to ChronicleMap. Since a significant portion of memory on the heap was moved to off-heap, the reduced JVM memory footprint resulted in less frequent and faster garbage collection cycles.

Improved code maintainability

Prior to using ChronicleMap, we had to maintain two copies for each in-memory cache. One was the “active” cache used to serve the API requests, while the other was the “passive” cache used for data updates. Since the application uses lock-free algorithms for low latency, we needed to keep separate caches and swap them in an atomic operation after every data update cycle. One of the benefits of ChronicleMap is its ability to handle concurrent access, without blocking, at high performance. This allowed us to keep a single copy of the off-heap cache and serve high-throughput requests while maintaining low latency. We were able to refactor and remove all the redundant code around maintaining the two copies and swapping them after each data update.

Reduction in code complexity

Before using ChronicleMap, we were forced to adopt several methods to optimize memory utilization, even at the cost of extra lookups. One such example was the layout of the page categories cache. Since every URL that was classified had a varied number of categories assigned, to save on memory, we created buckets with preallocated memory based on the distribution of URLs and their number of assigned categories, e.g., 5 million URLs had less than 10 categories and belonged to one bucket. In contrast, less than 1 million had up to 50 categories and belonged to another bucket.

Each bucket was a HashMap, where the key was the URL hash and the value was an index into a contiguous array divided into a fixed number of categories. The large contiguous arrays for each bucket were designed to drastically reduce the number of references needed if we had used a separate array for each URL. Without this, millions of references would significantly prolong any full garbage collection process. There was also another HashMap that mapped the URL to a particular bucket.

In addition, we had to maintain two copies, as mentioned in the above point. We could remove all this code complexity with a much simpler ChronicleMap because we no longer had to worry about memory.

Faster startups

ChronicleMap supports persisting in-memory cache to files. This allows the data to outlive the process it was created in — for example, to support hot application redeployment. We used this for fast application restarts without needing API calls to our caching layer to load the initial data.

While the code to use ChronicleMap was simple, achieving the optimized configuration, benchmarking, and instrumentation was challenging.
The following are the notable challenges we faced and had to overcome:

Pre-allocation based on estimation
The version we evaluated had no feature to resize the cache dynamically. We needed to anticipate growth in the datasets and preallocate the right amount of memory. In the case of the page categories cache mentioned above, we also needed to estimate the average number of categories per page and use that to allocate the memory. We had to monitor this continuously to ensure our assumption held true. If the distribution of page categories changed drastically, it would need a new configuration and redeployment of the application.

Benchmarking
ChronicleMap performs best when the data it holds can fit in RAM. However, utilizing the disk can hold data larger than the RAM. The lookups are slower when data does not fit in the RAM. Benchmarking the various configurations was not easy for us, and we had to build a custom test application to run several tests for different scenarios. Having published latencies for different scenarios would have simplified our work.

Documentation
The public product documentation did not cover all the details. We found multiple useful configuration settings in the source code, which the Javadoc documentation autogenerated from. The source code was also divided into many modules and projects, which we found challenging to navigate.

Handling custom values
One of our requirements was to support multimaps, an associative container where more than one value may be associated with and returned for a given key. We had to write code to support this. This also required a custom serializer code, which we implemented ourselves. Later on, we found another class in the code that we were able to extend, and we switched to using this. Not knowing it existed and how we could use it delayed our implementation a bit.

Current state and next steps

We have been using ChronicleMap for over a year in production, and it has proven to be a scalable, reliable and performant solution for our needs. In the future, we will evaluate some more options to further improve our application:

Migrating more datasets

We may look into migrating the remaining datasets to ChronicleMap and continue utilizing persisted cache for faster restarts.

Shared caches

In our current design, each API backend server maintains its own ChronicleMap cache. ChronicleMap supports sharing the same cache across different JVM processes. Creating an external service to move the source data into a single co-located ChronicleMap cache that can be shared across multiple backend servers would help us reduce network utilization.

Auto-resize

We recently discovered a new feature that can automatically resize the ChronicleMap caches. We intend to explore this further and see how we can integrate it.