How we scaled Graphite to 100,000 writes per second.

Srinivas Singanamalla
Walmart Global Tech Blog
9 min readMar 30, 2020

--

Photo by Luke Chesser on Unsplash

Introduction to Graphite

There are numerous monitoring tools available today, which includes Graphite, Prometheus and, Datadog.

While, interest in Prometheus and similar cloud offerings have picked up in the last five years, Graphite is still among the popular choices. (Source: https://graphiteapp.org/#caseStudies)

This blog talks about, how we vertically scaled our compute to achieve 100,000 writes per second. Before we delve into the details, let’s recap a bit about Graphite architecture.

Graphite Architecture

Graphite consists of 3 components.

  • carbon daemons
  • whisper database
  • graphite-web

The carbon daemons is a fast service, which listens for the time series data and either delegates to other carbon daemons or writes to the whisper storage. The whisper storage consists of the time series formatted .wsp files for each metric.

The following representation shows two ways to write to a graphite server: one without the carbon-relay and the other with the carbon-relay. We have used both approaches in our tests.

Writes: Introduce carbon-relay to scale the writes

In the first approach, without the carbon-relay, the application sends metrics in text format to port 2003, or in pickled format to port 2004. The pickled format is more efficient, as it sends the metrics in batches. The carbon-cache daemon, which is designed for high performance writes, is listening to these time series values, then writes to the whisper storage.

In the second approach, you introduce the carbon-relay daemon(s) in front of the carbon-cache daemons for higher performance. The carbon-relay act as metric forwarders to one or more carbon-cache daemons, which in turn writes to the whisper storage. You could use a round robin mechanism or a consistent hashing mechanism to forward the metrics, from the relay to the carbon-cache(s).

We have used the round robin strategy in our implementation.

During READS, the graphite-web comes into play. The api calls the carbon-cache daemon, which queries the whisper database.

Carbon-relay is not used during reads.

Reads: Relay is not used during reads

The graphite-web encompasses a user interface and a web api for rendering graphs. The graphite-web user interface could be replaced by any other third party providers like Grafana, which we have been using in our projects.

While, it is important to do the load tests of reads and writes together, we only focussed on writes, as writes far exceeds the reads in a monitoring application.

Motivation

Our current applications sends around 100K metrics per minute, which boils down to 1500 metrics per second. All our apps are configured to send metrics at one minute intervals. As we create more applications, our needs would increase. With the intention to scale, I dabbled across blog posts on Graphite, containing examples of infrastructure supporting 1 million metrics per second. We wanted to see, if we could ever scale our existing infrastructure to 1 million metrics per second. Would it even be achievable. With that lofty goal in mind, we started our journey to test the limits of our single node system.

HDD or SSD

Since carbon is an I/O intensive service, we wanted to bring in Solid State Drives (SSD) for our load tests. Because of budget constraints, we decided to hammer our existing Hard Disk Drives (HDD) with metrics.

The biggest ephemeral compute available to us, was an azure 8 core, Standard_D4_v2 with 28 GB RAM. We also decided to attach a 2 TB block storage volume. Block Storage is like a USB storage, and the data is durable to compute replacements. The block storage is plugged back in, once an ephemeral compute is replaced.

Before we started the load tests, we wanted to look at the azure benchmarked IOPS and throughput for the disk. The benchmarked IOPS for our block volume is 500 and the throughput is 60 MB/s. Even though, using SSD was out of the picture, we have put the values for comparison.

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/disks-types

Load Test Setup

In monitoring applications, Writes would always be way higher than the Reads, so we have focussed on Writes in our load tests. We used the haggar script (https://github.com/gorsuch/haggar) to generate the load for our graphite environment. The haggar script was installed on multiple client machines to generate the load and send it to our monitoring system.

Load Test Results

Single Carbon-cache

Our first set of tests was using a single carbon-cache daemon with no carbon-relay.

After a series of tests with this HDD disk configuration and tweaking graphite settings, we peaked at 2.4 million metrics writes per minute, (40,000 metrics per second). Our disk throughput was only 1 Mbps. Our disk IOPS were around 200 (benchmark is 500). Based on the benchmark data, there was definite room for improvement.

Writing 40,000 metrics per second. Using only single carbon-cache. CPU was > 90%

A single carbon-cache was able to provide 40,000 metrics per second, at the cost of high CPU (> 90%). At high CPU, the reads would suffer, so it is prudent to keep this within acceptable range, which I think should be below 65%.

Single Carbon-relay with 4 Carbon-caches

With the hope of relieving some pressure off of the carbon-cache, we increased the number of carbon-cache daemons by introducing a carbon-relay daemon.

Introduced carbon relays for multiple carbon caches

With this configuration, the carbon-cache CPU drastically reduced to 20%, however, it seemed to have transferred the work to the carbon-relay.

During reads, the carbon-relay is not used, so a configuration, where you relieve pressure off of carbon-cache would be a preferred choice.

Carbon-cache CPU is transferred to carbon-relays

As more carbon-cache daemons were writing metrics, we were able to triple the throughput to 3–4 Mbps at 570 IOPS. However, we couldn’t go beyond 40,000 metrics per second.

Max IOPS reached

Increasing IOPS and Throughput

In order to push higher writes per second, on a single machine, we had to increase the throughput and IOPS. One of the ways to achieve this was to cut the block storage into multiple slices, which created 7 additional partitions. It is as if, you are adding more disks, hence, increasing the IOPS and throughput.

Partitioned the Block Storage into 8 slices

You would see the corresponding increase in the writes volume, in the load tests below. With this additional change, we reran some of the earlier load tests.

As expected, the throughput jumped from 3 MBps to 15 MBps, for the same set of previously ran tests. Correspondingly, the IOPS increased from 500 to 3000 as well.

Slicing the disk helped in increasing the Throughput
Slicing the disk helped in increasing the IOPS

But we couldn’t go beyond the 40,000 metrics per second (2.4 million metrics per minute). Our carbon relay had reached the max capacity. Even though, we found a way to increase the disk IOPS and throughput, we were not writing it fast enough. The carbon-relay’s cpu was touching > 90%.

The next option was to alleviate carbon-relay’ work by introducing more of these relays. We needed some kind of load balancer and we chose HAProxy.

Multiple Carbon-relays with HAProxy

Introduced HAProxy

By adding more relays and carbon-caches in front of HAProxy, and tweaking different graphite settings, we were able to increase the writes per second.

By using 4 carbon-relay daemons and 8 carbon-cache daemons, we were able to hit 100000 metrics per second or 6 million metrics per minute.

Reached 100,000 metrics per second with HAProxy

The relay CPU and carbon cache CPU were around 65%, which just meant, that we were utilizing both of them efficiently.

Discussion about Graphite Configuration

Different knobs were tweaked to get the write performance. It would be worthwhile to discuss few of them.

To get the maximum benefit of using a multicore machine, adding more carbon-cache daemons, certainly helps in the performance.

The cpu for carbon-cache is reduced at the expense of carbon-relay cpu. Since carbon-relay is not used during READS, alleviating the load of carbon-cache is important.

MAX_UPATES_PER_SECOND

Capping this value below the writes per second volume, helps in the performance. The graphite batches the request by writing more data points per update and thus helps in better throughput. The default value of 500 turns out to be good enough.

MAX_CREATES_PER_MINUTE

This setting limits the number of files created per minute. We removed the limit, during our load tests. Creates are slower than updates. We didn’t want to have any restrictions on this value, as we wanted to gauge the maximum it can go.

Whisper files are fixed sized, which stores data to a second precision. This means that once the file is created, the size doesn’t change, as more metrics trickle in. When the file gets created, it has all the slots already configured, and hence, the metrics gets filled in faster. So in essence, the updates are always much faster than creates.

Cache Size and Queues

When the carbon-cache is not able to write to the disk faster, it starts storing the metrics in memory, thus increasing the usage and putting more pressure on the CPU. We saw this consistently, as we increased our load. However, the size of the cache was stable through out the load test run.

At 100,000 writes per second, the cache numbers were pretty high, but stable. Since our compute had a RAM of 28 Gb, we had the luxury of keeping 800,000 carbon-cache queues in memory. One thing to keep in mind though is, if the graphite server goes down, you could lose data stored in the memory.

At higher loads, the carbon cache queues were large

I believe these cache numbers would be smaller in case of an SSD, where you would be able to write faster than an HDD.

As we near the end of this write up, I would like to take note that, even though, we were targeting 1 million metrics per second, 100,000 writes per second by itself, is a very large number. To give a perspective, if your application is sending 1000 metrics per minute, you could potentially support the writes of 6000 applications.

Conclusion

Graphite is a very high performing metrics monitoring tool. It provides multiple knobs, that could be tweaked, to achieve the desired results. Slicing the disk helped us in increasing the IOPS, beyond the benchmarked disk values.

In order to scale higher, there are couple of other options to try out.

  • Horizontally add machines with the help of additional relays and carbon-caches
  • Evaluate Solid State Drives instead of the Hard Disk Drives, which has much higher IOPS and throughput
  • Evaluate replacing the default carbon with go-carbon, which seems to have better performance: https://github.com/lomik/go-carbon

References

Acknowledgements

I would like to thank Jay Patel for helping us with the oneops pack for the graphite. I would like to express my appreciation to Greg Favinger, Jyotiswarup Pai Raiturkar and Mike Barlotta for providing feedback to the article.

--

--