Shenandoah national park. Photo by skeeze from Pixabay

Leveraging Shenandoah to cut Cassandra’s tail latency

Amir Hadadi
Outbrain Engineering
6 min readJun 8, 2020

--

At Outbrain, we use Cassandra extensively. Until recently our Cassandra clusters were configured with G1 and frequently had pauses of 100–300ms. We used Shenandoah in some of our microservices and got great results so we decided to check what Shenandoah can do for Cassandra.
Eventually, we migrated all our clusters to Shenandoah.

Why Shenandoah?

Shenandoah is a garbage collector that decouples pause duration from the live set size and aims to keep stop-the-world pauses very short. The flip side can be lower throughput than G1 for two main reasons:

  1. It requires more GC barriers than G1 to allow for concurrent relocation of objects.
  2. It’s non-generational, so instead of usually collecting just the young generation like G1, it has to collect the entire live set in each GC cycle.

We used Shenandoah in some of our microservices, which gave us the confidence to test it with Cassandra. As Cassandra 3.11 does not support Java beyond 8, we had to use a build of JDK 8 that contained a backport of Shenandoah. BTW if you’re wondering what about ZGC, note that it has no backport to JDK 8 so it’s a non-starter for this use case.

Initial results

We started by configuring one Cassandra node with Shenandoah. Our configuration included:

  1. Server with 2 CPU sockets and 24 hyper threads
  2. 16GB Xmx
  3. -XX:ConcGCThreads=6, a quarter of the number of hyper threads

Looking at the concurrent collection time, we noticed that 20 seconds out of each minute were devoted to concurrent GC. As this was happening under light load, it was clear higher loads would drive Shenandoah into the ground.
To tackle this we could either devote more resources to GC (larger heap, more concurrent threads) or reduce the work the GC has to do (smaller live set, lower allocation rate).
We chose to reduce the live set by changing the on-heap memtables size from 4GB to 1GB, compensating to some extent by changing the memtable allocation type from offheap_buffers to offheap_objects. We also reduced the key cache size from 1GB to 10MB. With these changes in place, concurrent GC time dropped to around 3%.
To compare G1 and Shenandoah fairly we applied the above changes also to G1. We based the comparison on the total application stopped time as reported in the GC log, which includes TTSP, GC, and additional runtime operations¹. We generated a pause time distribution, by choosing a uniformly distributed point in time and checking in the GC log whether that point coincided with a gc pause and if it did, how long it took for it to finish.
The resulting pause times for G1 vs Shenandoah were:

To check if reducing the memtables size had increased the cluster’s write amplification, we monitored the ratio between total compacted bytes and flushed bytes, but didn’t observe any change. Note that the single table we have in this cluster uses leveled compaction.

Write Amplification

After deploying to an entire cluster, we monitored client-side latency. That sharp drop you see is exactly what we expected.

Cassandra 99.9th percentile latency viewed from the client

We wanted to check if the node’s throughput decreased, as theoretically the JVM should have higher CPU usage when running with Shenandoah vs G1.
Below is the CPU usage normalized per request to account for fluctuations in traffic. The node used G1 until 17:29, then it was switched to Shenandoah and restarted. After the metrics stabilized, no increase in CPU usage was seen.

The plot thickens

After a few weeks that Cassandra was running smoothly, we deployed to the next cluster which had row cache enabled.
The nodes were up for a few days and then started crashing one by one due to segmentation faults. Looking at the crash log, the offending stack trace pointed to the OHC library which manages the off heap row cache in Cassandra. Specifically, we’ve seen this method being called:

org.caffinitas.ohc.linked.OffHeapMap.removeEldest()

So the crash was related to row cache evictions. This was also evident in the cache usage graphs, as the time of reaching row cache capacity and time of crash coincided. As the crashes were not happening with G1, we decided to approach Shenandoah developers over at the Shenandoah dev mailing list. We supplied a reproducer using Cassandra stress with row cache enabled and handed it to Aleksey Shipilëv, one of the lead developers of Shenandoah. After an investigation, it was discovered that the bug causing this crash was already fixed upstream, but was never backported. After the fix was backported, we deployed it and indeed the crashes were gone.

Making it even better

After a few weeks in production, we noticed that pause times were growing increasingly longer. This is a chart of the time it took to scan the string table (the native hash table that holds interned strings) over a period of a few days.

String table scan time increases over time

Consulting with Aleksey Shipilëv we figured that by default, Shenandoah does not run class unloading (which includes string table purging) during the regular GC cycle. That’s because, until JDK 14, class unloading takes place during a safepoint and can increase pause time. However, this workload required occasional class unloading, as otherwise, the regular pauses were growing longer and longer.
To solve that, we tuned class unloading to run every 100 GC cycles.

-XX:+ClassUnloadingWithConcurrentMark
-XX:ShenandoahUnloadClassesFrequency=100

Additional improvements came from adding the -XX:+MonitorInUseList flag, which reduces monitor deflation² time, and trimming the size of Cassandra thread pools, which lowers thread stacks scanning time and helps mitigate native method marking time³.
After applying all these optimizations, we managed to get the 99.99th percentile pause time down to 10ms, effectively eliminating our GC issues.

Conclusion

Shenandoah can dramatically lower Cassandra's pause times without noticeably affecting throughput. However, you need to make sure Shenandoah keeps up with the allocation rate. One way to achieve that is to lower the on heap memtables size from the default ¼ heap and to configure the key cache size appropriately (the default is fine). For even lower pause times, add the following flags and make sure Cassandra thread pools are reasonably sized.

-XX:+ClassUnloadingWithConcurrentMark
-XX:ShenandoahUnloadClassesFrequency=100
-XX:+MonitorInUseList

If you are trying out Shenandoah, share your experiences at the Shenandoah dev mailing list. You will find an amazing level of engagement and willingness to help.

[1] See part II in Aleksey Shipilëv’s presentation for a good analysis of the different operations comprising a pause.

[2] -XX:+MonitorInUseList is the default since JDK 9, and there’s an RFE to move to monitor deflation outside of safepoints

[3] See this bug for details, native method marking time improved in JDK 12.

--

--