Using Flame Graphs to Analyze & Determine Cassandra Performance Under Stress (Part 2 of 3)

by Mark Price

In the first post in this series, I walked through the steps involved in setting up a load test of the Cassandra in-memory database. The load test was extremely simple, and comprised a single-server, single-client configuration.

At the end of the article, I showed that (unsurprisingly) as throughput plateaued towards a maximum, while the average request latency remained fairly constant, the long-tail of the latency profile started to increase.

At 60,000 writes per second, the 99.9th percentile latency was around 100 milliseconds, meaning that around 60 requests per second are taking 100ms or longer. The average latency at this request rate was a respectable 10ms, so it is clear that when measuring system responsiveness, it is necessary to consider the outliers rather than relying on the average as an indicator of performance.

Observing the system under stress

The last article ended with somewhat of a cliff-hanger; it was clear that some component in the system was struggling to keep up with the demands of the load test, but whether this was attributable to the server or the load-generator client was unclear.

In this installment, I’ll look at how to use performance profiling tools to inspect the load-generator and server, without having any specific knowledge about how they operate. This ability to analyze and determine system limits without a detailed understanding of the system in question is a valuable tool in the Performance Engineer’s toolbox.

Flame Graphs

Something that I’ve found immensely useful over the past few years are Flame Graphs — initially developed by Brendan Gregg, and further improved to work with Java, Node.js, and other runtimes by Brendan and other open-source contributors.

Flame Graphs’ primary utility is to give insight into where CPU time is being spent in an application. Under the hood, Flame Graphs are powered by the perf performance monitoring framework available on all Linux distributions.

Simply put, Flame Graphs are visualisations of the stack traces recorded by a sampling profiler that runs at the operating system layer. This means their use tends to have lower overhead than a traditional Java sampling profiler, which records stack traces by inducing JVM-wide safepoints.

A JVM safepoint is a mechanism used by the runtime to perform various operations on behalf of an application (e.g., garbage collection); unnecessary safepoint stalls with harm application throughput and latency, and much effort is put into low-latency systems to avoid these stalls.

To generate a Flame Graph for a Java process, I’ll be using a tool from the grav project, which contains wrappers to make working with Flame Graphs a little easier. When the script is run, it uses the perf tool to record application stack traces for the target process (specified by PID) at a set frequency. The stack traces are then aggregated to provide the visualisation.

# generate a flamegraph annotated with thread names

mark@server:~/grav$ ./bin/perf-thread-flames $PID

Though it’s not necessary to understand exactly what is happening on the server, some context will be valuable. So first, let’s take a look at the threads that are active in the server by running jstack against the process. I’ve elided some threads for clarity, and the remaining interesting thread names are:

GC threads

JIT Compiler threads

Native-Transport-Requests-*

PerDiskMemtableFlushWriter_0*

epollEventLoopGroup-*

MutationStage-*

There are a number of groups of similarly named threads. Making some inference from what open-source libraries are in use in the system, I can speculate that their uses are likely to be:

  • Native-Transport-Requests: handler threads for dealing with serviced requests
  • PerDiskMemtableFlushWriter: something that periodically persists data to disk
  • ePollEventLoopGroup: netty library working with network requests
  • MutationStage: possibly something to do with mutating requests (i.e., insert/update)

In total, the server process has 191 threads, some of which are grouped into thread-pools that have a common purpose (e.g., the epoll event loop workers). The initial Flame Graph, with its 191 separate ‘flames’ is not really useful, since the interesting information is hidden by the reduced width (see below).

191 Flames of initial Flame Graph

Luckily, the grav tools can aggregate flames together using the prefix of thread names. Using this functionality, it is possible to produce a Flame Graph that is still separated by function:

Aggregated Flames separated by function

With this view, it is much easier to see what the various thread-pools are doing.

Interpretation

Now that the data is available, some initial observations can be made.The original flamegraph can be found here (warning: 4MB download).

The majority of CPU time is spent in three thread-pools: epollEventLoopGroup, MutationStage, Native-Transport-Requests

As previously suspected, epollEventLoopGroup is mainly concerned with network reads/writes as shown when zooming in to the relevant stacks:

ePollEventLoopGroup Flame Graph (above and below)

The other two, busy thread-pools seem to end up doing some of the same work, updating the datastore via the StorageProxy. The Native-Transport-Requests pool reads requests off the netty queues and applies updates; it seems to spend about 25 percent of its time waiting for work in SEPWorker.doWaitSpin. This is a hint that this particular thread-pool is not CPU bound, and is sometimes idle waiting for work (though this may not be the whole story). The MutationStage thread-pool has a very similar profile.

Zooming out to look at everything else, there is a small amount of JIT compiler activity (this is to be expected — the Flame Graph was generated when the application was well warmed up). A similarly small proportion of time is spent on garbage collection, and there are some scheduled tasks that are periodically run (e.g., metrics handling).

Given what has been shown so far, it doesn’t seem that the system throughput is limited by the server being CPU-bound. This can be further investigated by looking at system-level statistics from the machine hosting the Cassandra application.

Running mpstat 1 will emit a per-second summary of CPU usage on the host machine:

Linux 4.9.15-x86_64-linode81 05/07/17 _x86_64_ (8 CPU)

18:53:16 CPU %usr %nice %sys %iowait %irq %soft %steal %idle

18:53:17 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

Plotting the interesting values over time will reveal to what extent the application uses the host CPU. In the chart below, the different phases of the load test can be seen where the percentage of user CPU time increases for a brief interval. During these busy periods, the percentage of idle CPU time does drop and hover around 10 percent, so the system is definitely close to capacity.

CPU use during different load test phases

The fact that the idle metric never reaches zero could be attributed to the doSpinWait method encountered earlier. Supposedly, when there is no work available for a particular thread, it parks itself for a short while, incurring a context switch.

In low-latency systems, this is usually dealt with by having a thread-per-CPU and using a busy-spin to ensure that processing time is not wasted in context switching. This, however, makes it much harder to shard a workload over a thread-pool.

After looking at possible causes of throughput limits, let’s take a quick look at latency. As the throughput increases, tail latency worsens to the point where there are several requests per second taking hundreds of milliseconds.

This could be attributed to a number of factors, but a likely culprit with any JVM-based application is garbage collection. The JVM garbage collector needs to perform regular (depending on workload) stop-the-world pauses in order to reclaim heap memory from objects that are no longer referenced.

One way to determine whether GC is responsible for latency outliers is to record exactly when an outlier occurred, and then see if there is a correlated entry in the server’s garbage-collection log. Unfortunately, since the Cassandra-stress tool only reports aggregates, the exact timestamps of the latency outliers are not available.

Instead, it is possible to make an educated guess. If the magnitude of the outliers are very similar to a GC pause seen within the test phase, then it is likely that the outlier was caused by garbage collection.

Charting the server’s GC pause times shows that there are plenty of stop-the-world pauses in the 100–200 millisecond range:

The 99.9-percent latency throughout of this load test was between 100ms and 500ms as throughput increased. So, these outliers could be explained by garbage collection within the server. Certainly as the load increases, more objects will be created, putting more pressure on the collector.

Looking at a similar chart from the load-generator client, it can be seen that the client also suffers from pauses that could explain the latency outliers:

This demonstrates an important point: when attempting to measure system latency, it is vital to ensure that the testing harness is not responsible for measurement error, or producing back-pressure on the system under test.

A likely culprit?

From these initial investigations, the most likely cause of throughput limiting is CPU on the server. Although the CPU was not maxed-out during the test, a less than 10-percent-idle level points to a system under stress.

The idle time may indicate that the server is unable to move requests between the various queues within the system, and that sometimes there are stalls in the request processing pipeline because there is no work to do, even though requests may be waiting at another stage.

In terms of latency outliers, garbage-collection pauses certainly look like a smoking gun. Stop-the-world pauses of the right magnitude are evident throughout the stress test. Given that the client seems to suffer quite a lot of pauses, it’s possible that the server-side latency is actually better than that reported by the client. For more accurate measurements, it is often better to calculate request/response latencies from a packet capture that is unaffected by stalls in the client process.

In the next post, I’ll be looking at other parts of the system that could be experiencing high load, and having an effect on the performance of the server during the testing process.


About the blogger: Mark Price is principal software consultant at Aitu Software, specializing in performance testing, system tuning, and low-latency development. He is a regular presence at renowned IT conferences, like QCon London 2017, and his most recent talk can be seen in this LinkedIn Slideshare. You can find more of his insight and follow his system performance investigations on Twitter (@epickrram) and at his blog, Technical Itch.


Please feel free to share below any comments, questions or insights about your experience with Apache Cassandra and system performance. And if you found this blog useful, consider sharing it through social media.