Metrics Storage: How We Migrated from Graphite+Whisper to Graphite+ClickHouse
Hi all, In my earlier post in blog of OLX Group Engineering, I wrote about the organization of a modular monitoring system for the microservice architecture. Nothing stands still, our project is constantly growing, and the list of stored metrics is growing too. In this post, I will tell you how we organized the migration from Graphite+Whisper to Graphite+ClickHouse under a high workload, about expectations, and the outcome of the migration project.
Before I go into the details of how we organized the migration from storing the metrics in Graphite+Whisper to Graphite+ClickHouse, I want to give you some background information about the reasons for this decision and the disadvantages of Whisper that we had to put up with for a long time.
The issues of Graphite+Whisper
1. High load on the disk subsystem
At the time of the migration, we were receiving approximately 1.5 million metrics per minute. At that metrics flow, our servers had a disk utilization rate of ~30%. It was quite acceptable — the system was stable, write and read speeds were high enough… until the day when one of the development teams rolled out a new feature and started churning out 10 million metrics per minute. That was when the disk subsystem became stretched, and we saw a 100% utilization rate. The problem was quickly resolved, but the aftertaste remained.
2. Absence of replication and consistence
Most likely, like everyone who uses or used Graphite+Whisper, we were routing identical metric flows to several Graphite servers to achieve resilience. And this did not cause any special problems — until the day when one of the servers would for some reason crash. Sometimes we managed to restart the crashed server fast enough and carbon-c-relay managed to restore the metrics from its cache, sometimes not. In the latter case, there was a gap in the metrics, and we patched it with rsync. The procedure was time-consuming. Fortunately, it did not happen often. Also we periodically took a random sample of metrics and compared them with others of the same type in the cluster’s neighboring nodes. Approximately in 5% of cases, several values differed, and we were not too happy about it.
3. Large amount of used space
Since we code in Graphite both infrastructure and business metrics (and now also Kubernetes metrics), we often fins ourselves in a situation where only a few values are present in the metric, and the .wsp file is created for the entire retention period and occupies the entire pre-allocated space, ~2 Mb in our case. The problem is further aggravated by the fact that over time multiple files of the kind emerge, and it takes a lot of time and resources to scan empty datapoints when generating reports.
I would like to point out that the problems described above can be dealt with using different methods with varying degrees of effectiveness, but the more data you receive, the more aggravated the problems become.
Considering all of the above (and remembering the earlier post), as well as the steadily growing number of metrics received and the desire to switch all metrics to a storage interval of 30 seconds (if necessary — up to 10 seconds), we decided to try Graphite+ClickHouse as a promising alternative to Whisper.
After visiting a few Yandex meetups, after reading a couple of posts on Habr.com, after studying the documentation and finding the appropriate components for a ClickHouse setup in Graphite, we decided to act.
This is what we wanted to achieve:
• to reduce the disk subsystem utilization from 30% to 5%,
• to reduce the amount of space used from 1 TB to 100 GB,
• to be able to receive 100 million metrics per minute at the server,
• data replication and resilience out of the box,
• to make this project manageable and make the transition within a reasonable time,
• to make the transition without downtime.
Ambitious enough, right?
For receiving the data using the Graphite protocol and then writing it to ClickHouse, we chose carbon-clickhouse (golang).
As the database for the storage of time series, the most recent at the time ClickHouse release of stable version 1.1.54253 was chosen. We encountered some problems with it — the logs were full of errors, and it was not entirely clear what to do with it. Jointly with Roman Lomonosov (the author of carbon-clickhouse, graphite-clickhouse, and many other things), we chose the older release 1.1.54236. Errors were gone — everything was running smoothly.
To read data from ClickHouse, we chose graphite-clickhouse (golang). As an API for Graphite — carbonapi (golang). To organize the replication between the ClickHouse tables, we used zookeeper. For the routing of metrics, we kept our favorite carbon-c-relay (see the earlier post).
Graphite+ClickHouse. Table structure
“graphite” is the database we created for the monitoring tables.
“graphite.metrics” is a table with the engine ReplicatedReplacingMergeTree (replicable ReplacingMergeTree). This table stores metric names and paths.
“graphite.data” is a table with the engine ReplicatedGraphiteMergeTree (replicable GraphiteMergeTree). This table stores the metric values.
“graphite.date_metrics” is a table that is populated conditionally, with the engine ReplicatedReplacingMergeTree. This table records the names of all the metrics encountered during the day. Reasons for creating it are described in the section “Issues” further down in this post.
“graphite.data_stat” is a table that is populated conditionally, with the engine ReplicatedAggregatingMergeTree (replicable AggregatingMergeTree). This table records the number of incoming metrics, broken down to the nesting level 4.
Graphite+ClickHouse. Component interaction
Graphite+ClickHouse. Data migration
As we remember from the expectations of this project, the transition to ClickHouse should be without downtime; accordingly, we had to somehow migrate our entire monitoring system to the new repository as transparently for our users as possible.
This is how we did this.
• In carbon-c-relay, a rule was added to send an additional flow of metrics to carbon-clickhouse of one of the servers participating in the replication of the ClickHouse tables.
• We wrote a small python script that, using the whisper-dump library, read all the .wsp files from our repository and sent the data to the described above carbon-clickhouse in 24 threads. The number of metrics received by carbon-clickhouse was up to 125 million/min, and ClickHouse handled this easily.
• We created a separate DataSource in Grafana to debug the functions used in existing dashboards. We put together a list of functions that we used, but that were not implemented in carbonapi. We completed these functions, and sent the PR’s to the authors of carbonapi (they deserve a special thanks).
• To switch the reading load, the endpoints in the balancer settings were reconfigured from graphite-api (API for Graphite+Whisper) to carbonapi.
- disk subsystem utilization reduced from 30% to 1%,
• the amount of space occupied reduced from 1 TB to 300 GB,
• we can receive 125 million metrics per minute per server (peaks at the time of migration),
• all metrics switched to a 30 sec storage interval,
• data replication and resilience achieved,
• transition completed without downtime,
• the entire project completed within about 7 weeks.
Our project was not without pitfalls. This is what we encountered after the transition.
1. ClickHouse does not always re-read configs on the fly, sometimes needs to be restarted. For example, when describing the zookeeper cluster in the ClickHouse configuration — it was not applied until clickhouse-server was restarted.
2. Large ClickHouse queries did not work, therefore we have following ClickHouse connection string in graphite-clickhouse:
3. New stable releases of ClickHouse often become available and may have bugs — be careful.
4. Dynamically created containers in kubernetes send a large number of metrics with a short and random life period. There are few datapoints for such metrics, and no problems with storage space have been observed. But when building queries, ClickHouse picks a huge amount of these metrics from the ‘metrics’ table. In 90% of cases, there is no data for them per slot (24 hours). However, the search for the data in the table ‘data’ takes time, ultimately resulting in a timeout. In order to solve this problem, we applied a separate view with information on the metrics encountered during 24 hours. Thus, when building reports (graphs) on dynamically created containers, we only query those metrics that were encountered within the given slot, rather than over the entire time, which manyfold accelerated the report generation. For this solution, graphite-clickhouse was built, including an implementation of the interaction with the date_metrics table.
Since version 1.1.0, Graphite officially supports tags. And we are looking into what and how has to be done to provide support for this initiative in the graphite+clickhouse stack.
Graphite+ClickHouse. Anomaly detector
Based on the infrastructure described above, we implemented a prototype of an anomaly detector, and it works! I will tell you more about it in my next post.
Subscribe, like, share, and stay tuned!