Scaling Graphite in a cloud environment
Our tormented journey moving from a physical server to a cloud based, highly available and scalable setup
TL;DR we are now using the full Go Graphite stack. If you want to know more about what we do at Teads and who we are you can have a look here.
This project started as we were using a single physical server to manage all our monitoring metrics on Graphite. This baremetal had 2 TB of SSD storage managed by a hardware RAID controller, 24 cores, 32GB RAM.
At that time, our services were writing 5 million monitoring metrics on Graphite at a rate of 50 000 data points per second. Those metrics come from thousands of AWS instances over 3 regions over the world.
We mainly use Graphite to monitor our stateless autoscaling components that represents a large amount of hosts/metrics. But we also use Datadog for our backend systems and Cloudwatch for AWS managed infrastructure.
On the read side, we use Grafana to produce dashboards that are displayed in the office. These dashboards are key to detect any regression during deployments & troubleshooting sessions.
This setup had two main issues:
- First, it is a single point of failure for all our services monitoring. The rest of our infrastructure is built to be highly available and we couldn’t achieve that here.
- We were also reaching the limits of this hardware and experienced several small outages. We were anticipating an x3–4 increase in data points per second for the end of year and this would not scale.
Moving from one to many Graphite servers
We started to think about our options to have both high availability and scalability. We also wanted to move Graphite on AWS, where all our services are hosted.
Looking at possible candidates, BigGraphite quickly drew our attention. This open source project is led by Criteo, it supports Cassandra as back-end for both data points and metrics metadata, instead of Whisper classical Graphite storage.
We were confident to try BigGraphite as the team behind it is solid. Using Cassandra was also attractive because we already run large clusters and have all the necessary tooling (Rundeck jobs and Chef cookbooks mostly).
While we steered toward BigGraphite we also assessed other options:
- Create a cluster with vanilla Graphite: We quickly dismissed this option because of potential performance limitations and the burden to manage many processes (Python is single threaded). We didn’t want to use Whisper files as it would be more complex to scale compared to using Cassandra.
- Move to Prometheus: Although this project looked very promising we didn’t want to break compatibility for our consumers. Scalability was also a concern with the need to shard manually, split the system per region and manage federation. Finally, data retention with Prometheus is short term and we wanted to keep a longer retention period with downsampling.
A word on our test setup
We wanted to test real-life workloads during our experiments. We leveraged carbon-c-relay on the production side to have double writes to both production and test environments.
I — Starting with BigGraphite
One cluster to serve them all
We first tried to store indexes and data points into one Cassandra 3.11.2 cluster.
We quickly figured this wouldn’t scale because of the workloads’ particularities. To begin with, the index workload relies on SASI indexes. This feature, contributed by Apple, appeared on Cassandra 3.4 and seemed efficient at the time. However, unlike Cassandra tables, the more you add nodes the slower queries get.
As we went along adding more and more nodes to handle our increasing number of data points, we progressively observed performance degradation on the index side. We had increasing response times and even timeouts for requests with high cardinality data. So it did not prove to be reliable for high volumes of data.
Separate indexing and data workload
Next, we tried to treat the two workloads independently. On the data workload, our benchmarks showed that we were able to handle about 50k data points per second on a 16 core instance (c5d.4xlarge). As we wanted to be able to handle 200k data points and include replication, this solution would require 8 Cassandra nodes for data only.
On the index side we faced the same previous limitations. Performance decreased as we added nodes.
We tried many optimizations to make the most out of Cassandra (mostly garbage collector tuning) and Linux (sysctl tuning).
BigGraphite is developed in Python. With the standard Python interpreter (CPython) we observed high CPU usage for data points ingestion. We also tried an alternative interpreter (PyPy) that gave us better results.
However, we were concerned that BigGraphite was complex to manage with lots of processes (no multithreading in Python). We needed to compensate with some configuration tooling (Chef) and also had some memory leak problems due to the language architecture (memory fragmentation).
After several significant improvement cycles on the storage and applicative side we realized this was a dead-end. Using beefies c5d.4xlarge, our read requests often timed out on BigGraphite although they could be served in a few seconds on the legacy server. We were not able to scale the indexing workload properly and the data workload would cost a lot of computing resources (not mentioning some crashes).
Note: the project now also supports Elasticsearch for the indexing workload. But it wasn’t available at the time of evaluation.
II — Back to Whisper
We decided to stick to Whisper as we could not find a better working solution.
Moving this Whisper workload to the cloud, we still wanted to have:
- High performance
- High availability
Our business grows and so must the platform. In order to support this growth, we went for:
- Multiple storage nodes behind multiple relay nodes behind AWS Network Load Balancer (NLB) for writes,
- Application Load Balancer for reads (for TLS, at the time Network Load Balancer didn’t have support for this).
We also thought about sharding per region, but then we wouldn’t be able to easily aggregate results and it’s sometimes needed.
Whisper workload biggest constraint is IOPS. Using AWS EC2, we did not have a hardware RAID controller anymore. We had to choose between Amazon Elastic Block Store (EBS) volumes and instance store ephemeral storage.
TL;DR: EBS does not do the job because we need hundred of thousands IOPS power. So we kept the c5d instances with ephemeral storage from our previous BigGraphite tests.
An instance might fail for some reason and anyway we regularly need to update them. This should not affect the availability of the solution. To prevent this situation, we implemented a deterministic replication logic at relay level, but more on that later.
III — The way to Go
The original Graphite project is written in Python. This is not ideal in a large scale environment because of performance and process management issues.
We looked at other implementations and found Golang ones that looked interesting. With Go we only have one process by service type and, compared to Python, CPU consumption is really low.
- carbon-relay-ng: a fast relay for carbon streams written in Go,
- go-carbon: a Go implementation of Graphite/Carbon server responsible for receiving metrics and writing them using a storage backend. It also provides APIs for reading metrics from a remote server,
- Carbonapi is responsible to respond to Grafana read requests,
- Carbonzipper allows read high availability by reading from all backends and merging results.
It’s worth mentioning that Go is the official language of the infrastructure team at Teads, it helped us a lot to be able to patch some of these components.
The big picture
This is a representation of the physical layout of our testing setup. The red path is for writes and the green one is for reads.
We also added a simple support for deterministic replication where points for a given server are replicated to the next server in the list (1->2,…,4->1).
Reads go through AWS ALB that dispatches them to API servers. Each API server knows its local Zipper. Each Zipper knows about all Carbon storage servers.
Carbon-relay-ng does not support replication and that’s needed for high availability. Some other relay implementations have replication support (carbon-c-relay, carbon-relay), but the replica servers cannot be determined from configuration.
Instead, they are picked randomly as the next non-already used node in the hash ring(!). We implemented a deterministic replication logic so we can make sure all replicas are in different availability zones.
Modifying carbon-relay-ng, again
The previous platform illustration was about the region storing data points. Other regions have local relays and forward writes to the main relays, in the main region. We did this initially because NLB endpoints could not be reached through inter-region VPC peering.
We use a unique DNS service name. We make use of Route53 Latency Based Routing to route clients to the right regional endpoint.
Relay targets are not just the main region NLB, because we would end up having a single TCP connection between the remote and the main region and one of the main relays would take all remote traffic. Instead, we target a subset of the main region relays so we have multiple connections, these are spread using carbon_ch algorithm.
Finally we modified carbon-relay-ng again so it can failover when a remote connection is down. In that situation, we don’t really care about which main relay a point goes to, but we make sure all points go to at least one relay.
The go-graphite setup successfully scaled during our seasonal peak period and handled 400k data points per second using a cluster of 14 c5d.4xlarge instances.
We use a replication factor of 2, that is, the cluster is able to write 800k points per second spread over tenths of millions of series. Compared to this, our legacy bare metal almost died at 50kpps.
The primary bottleneck of this setup is IOPS. Carbon/Whisper does not try to batch writes. Also, downsampling requires reading on the write path.
Living without a distributed datastore
While we use deterministic replication, we still need to add nodes and replace some. When a node goes down, we lose all the data it has because of local ephemeral storage. Replication is not sufficient because a replica can also go down later, and we would lose history.
When a node is replaced, we grab the data it should have using our replication logic: a node has replicas in the next node in the configuration so the next node has the data, and a node should have a copy of the previous node data.
We handle this with some shell scripting around carbonate, the official Graphite tool suite developed in Python to manipulate Whisper files and manage failures. Using carbonate, we can filter the data we need from the node neighbors.
At the moment, those scripts are launched:
- automatically when a node reboots, ephemeral storage keeps the data if the instance is just rebooted, but we need to fill holes in data,
- manually when we replace an instance.
When we scale the cluster, we use scripts around carbonate to rebalance the metrics/history following our replication logic.
We encountered a weird bug with carbon-relay-ng that sometimes ended up picking a different node than carbonate when multiple nodes collides in the hash ring. We patched it internally and still have an open issue for this upstream.
This is a serious issue when you use carbonate to determine things to copy/delete on nodes, you may end up missing series or deleting some active ones.
One limit we noticed in all Graphite implementations is the lack of data bucketing per time slice (the Carbon search API has support, Carbon servers does not implement it).
For example, if we build a request to get the last 4 hours of data for production.service.*, Graphite will look into every directory under production/service/*, read everything and return all results matching the request.
In a high churn environment, it means you often filter out a large amount of series because they hold past data. This issue and possible solutions are explained in this article: Prometheus 2.0: New storage layer.
Many thanks to my team and especially Tristan Spadi and Romain Hardouin for their help on this project.