Testing Cassandra Throughput Performance ( Part 3 of 3)

By Mark Price

Linode
Linode
Sep 20, 2017 · 8 min read
Image for post
Image for post

Over the last two posts in this series, I’ve looked at performing simple load-tests against a Cassandra server, and using Linux monitoring tools to observe how the server process is utilising system CPU resources.

When pushing a client/server architecture towards its throughput limits, there is another possible bottleneck that is worth investigating: the network layer.

The Linux networking stack

Before looking at whether there is a problem in this area, here’s a quick recap of how packet-receipt works in the Linux operating system.

Image for post
Image for post

When a network packet arrives, it is added to the network device’s receive (Rx) ring-buffer, and an interrupt is raised on the CPU associated with that ring-buffer. A network device may have many ring-buffers, and their interrupts can be spread across multiple CPUs.

Modern network devices take advantage of a feature called interrupt coalescing to reduce the number of hardware interrupts raised during high packet rates. This optimisation is important as a hardware interrupt will usurp any running code on a CPU.

Specifically, the code that should not be interrupted is the softirq daemon. This is the mechanism that deals with copying the network data from the device ring-buffer into a kernel socket-buffer. If the device were to continually interrupt the softirq process, then the kernel would fail to move packets to the socket buffer, increasing the possibility of dropped packets. When the softirq process attempts to dequeue a batch of packets the network driver will temporarily disable interrupts. (This is known as NAPI polling and is beyond the scope of this article.)

Image for post
Image for post

Whilst copying the network packets to the socket-buffer, the softirq process also performs filtering, processing and validation along the way. Once the packets are copied to the socket-buffer, they are available to the user-space program that is attempting to read from the socket. If there is a bottleneck somewhere in the network stack, then it will show up in the form of queueing or packet drops.

Queueing in kernel-space would show up as time-squeeze events in the softirq daemon (meaning that the softirq process was descheduled from the CPU while it still had packets to process); these can be observed in the file /proc/net/softnet_stats. (More on this later.)

If the receiving user-space application is not able to process packets from the socket-buffer fast enough, then the amount of data in the socket-buffer will spike and increase. One source of this information is the file /proc/net/tcp (for TCPv4 connections), maintained by the kernel, and containing a row for each active network connection on the system.

The data items of interest are the tx_queue and rx_queue columns, which display the currently queued bytes in the socket-buffer. In the case of a send-side connection, tx_queue reports the number of bytes that the user-space application has sent to the socket, but which have not yet been processed by the kernel for outbound delivery.

In the case of a receive-side connection, rx_queue reports the number of bytes that the kernel has copied into the socket-buffer (through the process described above), but have not yet been processed by the application.

Measuring is better than sampling

Given the two data sources listed above, I could check the time-squeeze counters before and after a load-test to check for queueing on the kernel side, but I would have to poll the tx_queue and rx_queue values since they are gauges — they represent a snapshot of currently queued bytes.

Using a polling method to read gauges is not an ideal solution, as it is possible for spikes in the socket-buffer depth to rise and fall between polls. If possible, it is best to measure these gauges every time data is queued/de-queued and record some aggregate metric.

is the perfect tool for this kind of task. I can write a program that will use a kernel-probe to record the amount of data queued in each socket, and report out a maximum queued amount per-second. Since this probe will run in the kernel context, and just perform a map lookup and integer addition, it will be very cheap and should not negatively affect application performance.

The project has a number of utilities for interacting with eBPF programs using Python. So in order to monitor queueing in the socket-buffer, I can write a small C program that will be inserted into the kernel, and deploy a user-space Python program to report out summary metrics.

Since I’ll need access to the socket-buffer, in order to determine the amount of queued bytes, I’ll insert the kernel-probe at a useful point in the TCP receive code-path. I’ve arbitrarily chosen the function as it has both the socket and incoming packet available.

Tracing packet receipt

In order to capture the required data, the first mirrors the signature of the tcp_v4_do_rcv method, and maintains a mapping of socket identifier to max buffered bytes. In this way, I can monitor how the reading application (here, the Cassandra server) is managing to process incoming packets.

The trace function (run in the kernel context):

int trace_socket_rcv(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb) {u16 dport = 0, family = sk->__sk_common.skc_family;if (family == AF_INET) {struct ipv4_key_t ipv4_key = {};ipv4_key.saddr = sk->__sk_common.skc_rcv_saddr;ipv4_key.daddr = sk->__sk_common.skc_daddr;ipv4_key.lport = sk->__sk_common.skc_num;dport = sk->__sk_common.skc_dport;ipv4_key.dport = ntohs(dport);u64 zero = 0, *total, *max;int rmem = sk->sk_rmem_alloc.counter;total = total_rcv_mem.lookup_or_init(&ipv4_key, &zero);(*total) += rmem + skb->data_len;max = peak_rcv_mem.lookup_or_init(&ipv4_key, &zero);if (rmem > (*max)) {(*max) = rmem + skb->data_len;}}return 0;};

The reporting program (run in user-space)

First, define the attach point for the tracing function:

bpf = BPF(text=prog)bpf.attach_kprobe(event=”tcp_v4_do_rcv”, fn_name=”trace_socket_rcv”)

Then, loop forever reporting out the max depth per socket:

while True:time.sleep(1)current_time = datetime.datetime.now()total_depth = bpf[“total_rcv_mem”]max_depth = bpf[“peak_rcv_mem”]if len(total_depth) == 0 and len(max_depth) == 0:print “No data captured”else:for socket, total in total_depth.iteritems():t.write(“{0},{1},{2},{3}\n”.format(current_time.strftime(“%H:%M:%S”), current_time.strftime(“%s”), to_socket_key(socket), total.value))for socket, peak in max_depth.iteritems():p.write(“{0},{1},{2},{3}\n”.format(current_time.strftime(“%H:%M:%S”), current_time.strftime(“%s”), to_socket_key(socket), peak.value))total_depth.clear()max_depth.clear()

This generates a CSV file reporting a timestamp, source socket address, target socket address, and socket-buffer depth:

07:24:32,1504333472,192.168.0.5:60852,163.172.101.56:80,290926

Monitoring message flow

The client-server architecture of the load-test means that there could be queueing in both directions (i.e., client->server or server->client), either of which could affect the load-test results. I’m going to assume that there is no queueing on the outbound path (though it would be possible to write a similar tap for the TCP send path), and focus on packet receipt.

While running the same load-test as before, I’ll run the network probe on both the client and server, and see if there is any evidence of queuing.

Image for post
Image for post

The server trace shows that as load increases over time (in terms of requests/sec), so too does the peak amount of data queued in the socket-buffer. The unit of the y-axis is bytes, so as packets are delivered, the trace reports that there is a peak of ~60kB queued in the socket-buffer awaiting processing by the server. To find out how many queued requests this is, I can do some simple arithmetic on the numbers reported by sar for aggregate network bandwidth:

06:04:54 AM IFACE rxpck/s txpck/s rxkB/s txkB/s06:04:55 AM eth0 21816.00 21345.00 18427.29 2269.00

The server was processing ~18MB/sec of inbound data (rxkB/s), while the load-test client reported performing ~60,000 writes/sec, implying that the average request size is around 300 bytes. So the figure of 60kB queued represents about 200 queued requests.

Since this is a write-heavy workload, and any response to a write will be minimal (i.e., success/failure), there should be less inbound network processing to be done on the client machine. The chart of the queued inbound traffic reflects this lower workload, as the amount of queued data barely gets above zero:

Image for post
Image for post

Taking a quick look at /proc/net/softnet_stat reveals that there were some time-squeeze events, suggesting that using dedicated CPU resource for softirq processing could help with throughput.

Wrapping up

In this post, I’ve demonstrated the use of lightweight in-kernel tracing to accurately measure the network processing capability of a program. From the results of these investigations, it is fair to say that the server was able to handle the load from the client, with only a few hundred requests awaiting processing in the kernel socket-buffer. Though on further investigation, it becomes clear that there is a reason for this.

A closer analysis of the load-test client log shows that the client limits the number of requests in-flight in order to stop from overwhelming the server:

Mode:API: JAVA_DRIVER_NATIVEConnection Style: CQL_PREPAREDCQL Version: CQL3Protocol Version: V4Max Pending Per Connection: 128

In this load-test mode, the goal is to push the server to the point where it can still just about deal with the incoming requests, but back off if the server has not responded to previous requests. This explains why the amount of queued data in the socket never spikes dramatically and catastrophically.

If the artificial limit were removed, it’s likely that server CPU usage would increase, and as the load became too great, the amount of queued data would keep increasing until TCP back-pressure made it back to the client process.

Over this series of posts, I’ve covered how to use the latest Linux tracing tools to understand and investigate application behaviour, without needing to introspect or modify the application itself. When investigating performance issues, or just trying to gain a better understanding about how a system performs under stress, these tools are immensely valuable and are worth becoming familiar with.

Linux tracing tools can be used to target almost any system resource, from instruction retirement rate and cache-misses, to disk and page-cache latencies; these posts have barely scratched the surface of what is available.

There is a wealth of information available to those interested in the interaction between software, the operating-system, and the hardware it runs on, hopefully this will serve as a useful introduction — happy investigating!

About the blogger: Mark Price is principal software consultant at, specializing in performance testing, system tuning, and low-latency development. He is a regular presence at renowned IT conferences, like, and his most recent talk can be seen in this. You can find more of his insight and follow his system performance investigations on Twitter () and at his blog,.

Please feel free to share below any comments, questions or insights about your experience with Apache Cassandra and system performance. And if you found this blog useful, consider sharing it through social media.

Linode Cube

We’re covering everything from tech news and industry…

Thanks to Keith Craig

Linode

Written by

Linode

Cloud Hosting for You. Sign up today and take control of your own server! Contact us via ticket or email for all support inquiries: https://www.linode.com/contact

Linode Cube

We’re covering everything from tech news and industry happenings to event recaps and general tips.

Linode

Written by

Linode

Cloud Hosting for You. Sign up today and take control of your own server! Contact us via ticket or email for all support inquiries: https://www.linode.com/contact

Linode Cube

We’re covering everything from tech news and industry happenings to event recaps and general tips.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store