Making design decisions for ClickHouse as a core storage backend in Jaeger
Overview
ClickHouse database has been used as a remote storage server for Jaeger traces for quite some time, thanks to a gRPC storage plugin built by the community. Lately, we have decided to make ClickHouse one of the core storage backends for Jaeger, besides Cassandra and Elasticsearch. The first step for this integration was figuring out an optimal schema design. Also, since ClickHouse is designed for batch inserts, we also needed to consider how to support that in Jaeger.
There are different approaches for schema and batch insert support. So we decided to do performance benchmarks on these options, to figure out the optimal ones. In this blog, we will share our benchmark setups and results, from which we will recommend design choices.
For tracing data, we found two existing schema design approaches from the jaeger-clickhouse gRPC plugin (or plugin in short) and OpenTelemetry Collector’s ClickHouse exporter (or exporter). Both of these allow batching a configurable number of spans internally and sending batches of spans to ClickHouse. To handle batch inserts, ClickHouse also provides another option of using asynchronous insert feature that we would like to try. So overall, our performance benchmarks were carried out in these steps:
- Step 1: Figure out which schema design approach performs better between the plugin’s and the exporter’s.
- Step 2: For the higher-performance schema, figure out how it can be improved to perform even better.
- Step 3: With the schema improved from step 2, try ClickHouse’s asynchronous insert feature and compare the performance with internal client-side batching approach.
We ran performance benchmarks on different dimensions and configs. In this blog, we will share the setups and results of the benchmarks that showed the most representative results and led to our design choices.
The configs/scripts/code used to carry out the benchmarks can be found in this repo.
Step 1: Figure out the better schema design approach
The plugin and the exporter use different database schemas. For full details of each schema, you can visit the benchmark repo. Here, we will summarize the key differences between them:
To evaluate these schemas, we used these criteria (ordered by priority):
- Search performance
- Insert performance
- Compression ratio
The reasons for this order of priority are:
- Search performance is an important metric that clearly shows the difference between two schemas.
- Insert performance is also an important metric. But it is affected by other factors outside of the schema or the database, such as the number of insert threads from the client side.
- Compression ratio also shows the performance of each schema, that is, how well each schema compress data and saves storage space. However, compression ratio depends very much on the cardinality of data, which is different case by case in reality. Also, a high compression ratio may also indicate high decompression cost in search queries. So there are pros but also cons here.
The ultimate purpose of the benchmarks was to compare two schema approaches, not to examine the performance of setups, so we just aimed at a simple setup. For full technical details of the setup, please visit the benchmark setup directory. Overall, our setup consisted of:
- 1 instance of Jaeger Collector with jaeger-clickhouse plugin with batch size 100,000 for the plugin schema
- 1 instance of OpenTelemetry Collector with ClickHouse exporter with batch size 100,000 for the exporter schema
- 1 instance of ClickHouse single-node server for each collector
- 3 instances of jaeger-tracegen that generated 495 millions spans in 1.2 hours to each collector. To ensure that our data was representative enough, we configured generated traces to have 11 spans per trace (1 parent span and 10 child spans), 11 attributes per child span, 97 distinct attribute keys, and 1000 distinct attribute values. (There was one benchmark where we increased the number of distinct values in attribute keys/values to 1000/10000. But the benchmark results didn’t change noticeably).
For benchmark results, we retrieved data from ClickHouse system tables. The SQL queries to retrieve benchmark results can be found in the benchmark repo.
Here’s the performance results of two benchmarked schemas (important metrics are highlighted):
Regarding inserted_spans_per_sec_single_thread metric, this is the estimated inserted spans per second in a single insert thread, measured by dividing the number of spans saved by the sum of inserts’ durations. We needed to measure by single insert thread because the numbers of insert threads in the plugin and the exporter are different. The real inserted spans per second with multi-threaded inserts would be much higher.
With these results, we can see that:
- Search performance: the exporter schema performed better in most search queries except for searching all distinct services, searching all distinct operations, and retrieving all spans by trace ID. It seems that the search performance of the plugin schema was worse because it had to read more tables, hence took more time. However, the performance of searching all distinct services and searching all distinct operations of the plugin schema was much better, thanks to its jaeger_operations materialized view. This is one thing the exporter schema could learn from.
- Insert performance: Both used roughly equal memory per span insert. And the exporter schema inserted more spans per second per single insert thread than the plugin schema. So the exporter schema performed better here.
- Compression ratio: The plugin schema had better compression ratios.
As stated above, we prioritized search performance and insert performance over compression ratio, so to us the exporter schema was the better approach. The next step was trying different improvement opportunities on this schema.
Step 2: Try improvement opportunities on OpenTelemetry Collector’s ClickHouse exporter’s schema
The first thing we tweaked was ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId). According to ClickHouse’s doc, to ensure optimal columns’ compression ratios and search performance, the rule of thumb is to pick the columns from which data will be filtered and place lower-cardinality columns first in ORDER BY. So we tried adding two other filtering columns: SpanAttributes, Duration, sorting the cardinalities of the picked columns and had this new ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), Duration, SpanAttributes, TraceId).
With this new ORDER BY, the only noticeable change we observed was that the compression ratio of the Duration column doubled from 3.89 to 7.43, so adding Duration column to ORDER BY seemed like a reasonable choice. We didn’t see noticeable changes in the compression or search performance of the SpanAttributes column. So we’ll take more consideration about whether to include this column in ORDER BY. Also, since the cardinalities of filtering columns may be different case by case, we will allow users to configure their ORDER BY for optimal performance.
Next, we tried tweaking the data types of some columns since they showed better compression ratios in equivalent columns in the plugin schema:
With these changes in data types, we saw these changes in related metrics:
From the metrics, we saw that the new data types performed better:
- The compression ratios were improved. Especially with SpanAttributes, both uncompressed size and compressed size were reduced and compression ratio was increased.
- Searching all spans’ information by tag/attribute took less time and memory.
We’ll include these changes into our design choices.
(The setups and modified codebase of the OpenTelemetry Collector’s ClickHouse exporter for these schema changes can be found in the benchmark repo).
Step 3: Try ClickHouse’s asynchronous insert feature on the furthest improved schema
In the previous benchmarks, we tried client-side batching in both Jaeger Collector and OpenTelemetry Collector. In this step, we tried server-side batching with ClickHouse’s asynchronous insert feature. Overall, we ran three benchmarks:
- Server-side batching alone
- Server-side batching combined with small client-side batching of 10,000 spans per batch
- Server-side batching combined with large client-side batching of 100,000 spans per batch
The setups and modified codebase of the OpenTelemetry Collector’s ClickHouse exporter for these benchmarks can be found in the benchmark repo.
Here’s the benchmark results:
From the metrics, asynchronous insert didn’t improve the rate of inserted spans, even though it decreased the memory usage per span insert. In fact, asynchronous insert in ClickHouse has a wide range of settings and we just tried one example of settings in our benchmark. We need more experiments and feedback from users to decide whether we should support this in our storage backend.
Conclusion
From the benchmark results, we concluded our design choices for ClickHouse as a core storage backend for Jaeger will notably consist of:
- A single table for all data fields
- A materialized view for fast services and operations retrieval to Jaeger UI
- A materialized view of trace ID’s time range for fast spans retrieval by trace ID
- ORDER BY lower-cardinality filtering columns first: We will provide a default ORDER BY and also allow users to configure since the cardinalities of filtering columns may be different case by case.
- Span attributes stored with Nested type
- Support for client-side batching
The next step would be implementing all these choices into actual features.
Stay tuned!