Scaling Graphite in a cloud environment

Our tormented journey moving from a physical server to a cloud based, highly available and scalable setup

Vincent Miszczak
Jul 9, 2019 · 9 min read
Image for post
Image for post

TL;DR we are now using the full Go Graphite stack. If you want to know more about what we do at Teads and who we are you can have a look here.

Initial setup

At that time, our services were writing 5 million monitoring metrics on Graphite at a rate of 50 000 data points per second. Those metrics come from thousands of AWS instances over 3 regions over the world.

We mainly use Graphite to monitor our stateless autoscaling components that represents a large amount of hosts/metrics. But we also use Datadog for our backend systems and Cloudwatch for AWS managed infrastructure.

On the read side, we use Grafana to produce dashboards that are displayed in the office. These dashboards are key to detect any regression during deployments & troubleshooting sessions.

This setup had two main issues:

  • First, it is a single point of failure for all our services monitoring. The rest of our infrastructure is built to be highly available and we couldn’t achieve that here.
  • We were also reaching the limits of this hardware and experienced several small outages. We were anticipating an x3–4 increase in data points per second for the end of year and this would not scale.

Moving from one to many Graphite servers

Looking at possible candidates, BigGraphite quickly drew our attention. This open source project is led by Criteo, it supports Cassandra as back-end for both data points and metrics metadata, instead of Whisper classical Graphite storage.

We were confident to try BigGraphite as the team behind it is solid. Using Cassandra was also attractive because we already run large clusters and have all the necessary tooling (Rundeck jobs and Chef cookbooks mostly).

While we steered toward BigGraphite we also assessed other options:

  • Create a cluster with vanilla Graphite: We quickly dismissed this option because of potential performance limitations and the burden to manage many processes (Python is single threaded). We didn’t want to use Whisper files as it would be more complex to scale compared to using Cassandra.
  • Move to Prometheus: Although this project looked very promising we didn’t want to break compatibility for our consumers. Scalability was also a concern with the need to shard manually, split the system per region and manage federation. Finally, data retention with Prometheus is short term and we wanted to keep a longer retention period with downsampling.

A word on our test setup

I — Starting with BigGraphite

One cluster to serve them all

We quickly figured this wouldn’t scale because of the workloads’ particularities. To begin with, the index workload relies on SASI indexes. This feature, contributed by Apple, appeared on Cassandra 3.4 and seemed efficient at the time. However, unlike Cassandra tables, the more you add nodes the slower queries get.

As we went along adding more and more nodes to handle our increasing number of data points, we progressively observed performance degradation on the index side. We had increasing response times and even timeouts for requests with high cardinality data. So it did not prove to be reliable for high volumes of data.

Separate indexing and data workload

On the index side we faced the same previous limitations. Performance decreased as we added nodes.

We tried many optimizations to make the most out of Cassandra (mostly garbage collector tuning) and Linux (sysctl tuning).

BigGraphite is developed in Python. With the standard Python interpreter (CPython) we observed high CPU usage for data points ingestion. We also tried an alternative interpreter (PyPy) that gave us better results.

However, we were concerned that BigGraphite was complex to manage with lots of processes (no multithreading in Python). We needed to compensate with some configuration tooling (Chef) and also had some memory leak problems due to the language architecture (memory fragmentation).

Image for post
Image for post
CPU usage with CPython and then with PyPy (from 13:45 on the chart)

After several significant improvement cycles on the storage and applicative side we realized this was a dead-end. Using beefies c5d.4xlarge, our read requests often timed out on BigGraphite although they could be served in a few seconds on the legacy server. We were not able to scale the indexing workload properly and the data workload would cost a lot of computing resources (not mentioning some crashes).

Note: the project now also supports Elasticsearch for the indexing workload. But it wasn’t available at the time of evaluation.

II — Back to Whisper

Moving this Whisper workload to the cloud, we still wanted to have:

  • Scalability
  • High performance
  • High availability

Scalability

  • Multiple storage nodes behind multiple relay nodes behind AWS Network Load Balancer (NLB) for writes,
  • Application Load Balancer for reads (for TLS, at the time Network Load Balancer didn’t have support for this).

We also thought about sharding per region, but then we wouldn’t be able to easily aggregate results and it’s sometimes needed.

High performance

TL;DR: EBS does not do the job because we need hundred of thousands IOPS power. So we kept the c5d instances with ephemeral storage from our previous BigGraphite tests.

High availability

III — The way to Go

We looked at other implementations and found Golang ones that looked interesting. With Go we only have one process by service type and, compared to Python, CPU consumption is really low.

More specifically, we looked at the go-graphite stack. Parts of it were originally open sourced by Booking.com:

  • carbon-relay-ng: a fast relay for carbon streams written in Go,
  • go-carbon: a Go implementation of Graphite/Carbon server responsible for receiving metrics and writing them using a storage backend. It also provides APIs for reading metrics from a remote server,
  • Carbonapi is responsible to respond to Grafana read requests,
  • Carbonzipper allows read high availability by reading from all backends and merging results.

It’s worth mentioning that Go is the official language of the infrastructure team at Teads, it helped us a lot to be able to patch some of these components.

The big picture

Image for post
Image for post

This is a representation of the physical layout of our testing setup. The red path is for writes and the green one is for reads.

Writes go through AWS NLB that dispatches them to relays. Each relay knows all Carbon servers and dispatches using carbon_ch consistent hashing algorithm.

We also added a simple support for deterministic replication where points for a given server are replicated to the next server in the list (1->2,…,4->1).

Reads go through AWS ALB that dispatches them to API servers. Each API server knows its local Zipper. Each Zipper knows about all Carbon storage servers.

Modifying carbon-relay-ng

Instead, they are picked randomly as the next non-already used node in the hash ring(!). We implemented a deterministic replication logic so we can make sure all replicas are in different availability zones.

Modifying carbon-relay-ng, again

Image for post
Image for post

We use a unique DNS service name. We make use of Route53 Latency Based Routing to route clients to the right regional endpoint.

Relay targets are not just the main region NLB, because we would end up having a single TCP connection between the remote and the main region and one of the main relays would take all remote traffic. Instead, we target a subset of the main region relays so we have multiple connections, these are spread using carbon_ch algorithm.

Finally we modified carbon-relay-ng again so it can failover when a remote connection is down. In that situation, we don’t really care about which main relay a point goes to, but we make sure all points go to at least one relay.

Results

We use a replication factor of 2, that is, the cluster is able to write 800k points per second spread over tenths of millions of series. Compared to this, our legacy bare metal almost died at 50kpps.

The primary bottleneck of this setup is IOPS. Carbon/Whisper does not try to batch writes. Also, downsampling requires reading on the write path.

Living without a distributed datastore

When a node is replaced, we grab the data it should have using our replication logic: a node has replicas in the next node in the configuration so the next node has the data, and a node should have a copy of the previous node data.

Image for post
Image for post
Illustration with a replication factor of 2

We handle this with some shell scripting around carbonate, the official Graphite tool suite developed in Python to manipulate Whisper files and manage failures. Using carbonate, we can filter the data we need from the node neighbors.

At the moment, those scripts are launched:

  • automatically when a node reboots, ephemeral storage keeps the data if the instance is just rebooted, but we need to fill holes in data,
  • manually when we replace an instance.

When we scale the cluster, we use scripts around carbonate to rebalance the metrics/history following our replication logic.

We encountered a weird bug with carbon-relay-ng that sometimes ended up picking a different node than carbonate when multiple nodes collides in the hash ring. We patched it internally and still have an open issue for this upstream.

This is a serious issue when you use carbonate to determine things to copy/delete on nodes, you may end up missing series or deleting some active ones.


Further thoughts

For example, if we build a request to get the last 4 hours of data for production.service.*, Graphite will look into every directory under production/service/*, read everything and return all results matching the request.

In a high churn environment, it means you often filter out a large amount of series because they hold past data. This issue and possible solutions are explained in this article: Prometheus 2.0: New storage layer.

Many thanks to my team and especially Tristan Spadi and Romain Hardouin for their help on this project.

See also

Teads Engineering

150+ innovators building the future of digital advertising

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store