How to build a scalable, low-cost observability solution

d0uble
6 min readJul 23, 2022

Hi, everyone. I’m Vladimir Borodin, CTO at DoubleCloud. DoubleCloud is a platform for building cost-effective, subsecond analytical applications on tightly integrated, proven open-source technologies, like ClickHouse or Kafka. Today I would like to talk about how we created a scalable and low-cost observability solution. I think this topic would be interesting for data engineers, software developers and DevOps engineers — for all the people who monitor their production and need tools for debugging what really happens.

What is Observability?

All of us want to provide a better experience for our customers. To do this, we need to be sure that our service works with good uptime and performance, it’s stable and fast. And, if something goes wrong, we should have a fast way to debug what really happened and why, and fix it proactively, before the user even notices what’s happening. This is where the Observability Solution can help. Observability is the ability to understand the current state of a system from the data it generates, such as logs, metrics, and traces.

There are a number of Observability tools, ranging from open-source tracing backend and metrics or log collection tools like Jaeger, Zipkin, Prometheus, Loki, and Vector, to paid SaaS systems like Honeycomb, Datadog, Splunk or Dynatrace. Both open source and SaaS solutions have their pros and cons. The latter are ready-to-use products supported by the vendor, allowing you to immediately start using the product without building your own solution or maintaining tools. Such products, however, can cost hundreds of thousands of dollars, whereas open-source tools are mostly free of charge. Open-source tools provide customization, flexibility, and give you complete control of your data. They help avoid vendor lock and often have a large community ready to help.

But what about an all-in-one, open-source Observability solution? There is one option called OpenTelemetry. Unfortunately, there is no good implementation right now and the API is currently still quite unstable. I can imagine that OpenTelemetry will become the standard in several years.

When we began building observability for ourselves, we started solving logging, metrics and tracing problems, each in a different way. We are building a cloud-agnostic solution, so we mainly use open-source products. Before digging into it, we need to cover a couple of important things about our service implementation. Our service has two main parts, the Control Plane and the Data Plane. The Control Plane is a bunch of micro services deployed on Kubernetes managed by AWS. Each development team has its own k8s cluster in a separate AWS account. And the Data Plane is where we actually create virtual machines. The runtime is EC2 instances in AWS, created for each user in his own VPC.

Let’s take a look at our solution.

Step I: Tracing

Tracing is a convenient way to debug performance of your applications and see where their performance bottlenecks are. There is an application which is instrumented with a special SDK that sends tracing points on events of interest to the Jaeger collector. The collector gathers data into batches and sends them to the Jaeger server in a separate Kubernetes cluster. As a storage for all this data it uses ClickHouse, which we manage ourselves, so we are dogfooding our own service. That’s one of our ideas, and we are heavily involved in the open source project called Jaeger over ClickHouse, which gives Jaeger the ability to store data in ClickHouse.

We are using ClickHouse in particular because it’s a sweet spot for workloads when approximately one million rows per second is inserted and you rarely query the data but want to get the result in less than a second.

This solution is really low-cost compared to Jaeger over Cassandra (i.e. compression ratio for Jaeger data in ClickHouse is 11x), and it is scalable: you can start with a cluster of four cores and scale it up to hundreds.

Step II: Logging

This is how our logging solution looks like. There is still an application deployed in Kubernetes, and there is an agent — we use Vector, that collects logs from all the applications and sends them to Kafka as a distributed queue. Then there is a transfer that pulls data from Kafka and sends it to ClickHouse. That’s not open source, so if you want to build this solution on your own without using commerce service, you can do it with a couple of alternatives, such as native support for the Kafka engine in ClickHouse and Airbyte. We manage all these components ourselves, meaning we are once again dogfooding our own service. After the logs are in ClickHouse, there are two ways of viewing them. The first one is our own web interface which we provide to end users of our service. And for us there is an interface through Grafana with a plugin and a special query backend.

We have half a petabyte of uncompressed data, which is inserted with an average rate of 500 megabytes per second, and we still have a subsecond query latency for getting results of fresh logs (last three days). Thanks to the feature called ClickHouse over S3 which our team was heavily engaged in developing in upstream ClickHouse. Of half a petabyte of data, we store on SSD only eight terabytes and all other data is stored in S3, where it is also compressed with 8x compression ratio. According to our calculations for our data volume, it’s $40 for one terabyte of logs per month, which is literally ten times cheaper than any other ready-to-go solution on the market.

Step III: Metrics

Let’s switch to metrics. We use Prometheus, managed by AWS, called AMP service. For the Control Plane, each of our applications exposes a Prometheus endpoint, there is a Prometheus deployed in the same Kubernetes cluster which gathers all the metrics from all our applications, and then writes it to Prometheus managed by AWS with remote write API. For Data Plane, we use Telegraph as an engine for collecting the metrics we really want. Telegraph sends metrics to a Vector aggregator deployed in the Control Plane. Vector can’t write it directly to Prometheus managed by AWS since it doesn’t support AWS SigV4, so we expose a Prometheus endpoint. Prometheus gathers those metrics and then sends it to AMP. And we also have two entry points, one for users — we show the metrics in our web interface, and another one for us — we use Grafana.

Our dream is to make Prometheus over ClickHouse in the future, because that would lower our cost for storing metrics and it would help avoid cloud lock.

AWS cross account network setup

From an infrastructure point of view our network setup is quite complicated. We use the Transit gateway to send data from many networks in the Data Plane to the Control Plane. And to push the data from Control Plane network to a special network in Data Plane, we use our feature called VPC peering. It gives to our end users an ability to peer our VPC, where we provision our clusters and other resources, with the user’s own VPC. For cross account traffic between different Control Planes we use endpoint services/interfaces. So we use all the network tools available in AWS at once :)

The traffic is not going through the public internet, it’s important from a security point of view, and from the cost point of view, traffic through VPC peering doesn’t cost you anything.

This is how our observability stack looks like. Most of the components are open-source, so it’s cloud agnostic and you can take it and build it on your own, even on premise. Another advantage of our solution is that it is scalable. You can start with gigabytes of logs and then scale it to petabytes with several hundreds of shards. But the main advantage of the solution is the cost. Forty dollars per terabyte of logs is literally more than ten times cheaper than other ready to go solutions, and when you query your logs or traces, you get sub-second latencies.

--

--