KubeCon Europe 2024: Cloud Native In La Ville-Lumière | In The Spotlight

Zhu Jiekun
9 min readMar 24, 2024

--

It’s early Saturday morning in Paris, France. KubeCon CloudNativeCon Europe 2024 — the largest event ever hosted by CNCF — has just concluded. At Observability Day, I shared one of the many solutions to correlate the three pillars of observability: creating Metrics and Logs from Spans.

Session Review

Intro | Video | Slides | Original Blog (In Simplified Chinese)

In TTChat, microservice observability faces two challenges: how to instrument microservices without code modification across diverse programming languages and frameworks, and how to connect logs, metrics and traces. We will share our story of building an observability platform. We relied on client SDK initially, collecting 200TB data per day, and then transitioned to using eBPF for auto-instrumentation. We explored various approaches to connect the three pillars of observability. By leveraging the OpenTelemetry Collector and self-built exporters, we preserved meaningful observability data, particularly in connecting error traces and logs with metrics. We will share those practical experiences and discuss why connecting eBPF data without context is difficult. We will not only showcase the key of different approaches but also delve into the challenges we encountered. Attendees can learn how to connect observability data and take away valuable lessons to avoid repeating our mistakes.

Observability Day | Room N01-N02

Exemplar: The Pain Point

The essence of correlating observability signals lies in finding their commonalities, and it’s clear that the Trace Context is the link that connects them. But think back to the metrics alerts you usually see:

  • Do they come with the corresponding logs?
  • Do they provide the suspicious Traces?

If the answer to both are no, then congratulations, you are like us — and possibly many other developers — who did not attach Exemplars with Metrics. And this blog post might help you.

For 80% time of my career so far, I wasn’t even aware that exemplars can help, perhaps because it (exemplars) is a new concept, or perhaps I did not have a good opportunity to learn about it. For Go programs, maintaining the propagation of Context is often more problematic — Not everyone adheres to Go’s standards, and moreover, this ‘non-standard’ code can still run perfectly.

Therefore, even with the solution of Exemplars, it’s more often to be used by the developers of framework, scaffold, and SDKs, not the developers of the business logic. However, monitoring metrics are often closely related to business logic, which means that business logic actually requires Exemplars even more to provide the ability to correlate signals.

Span Metrics Connector

If correlating signals distributed in different places doesn’t work, then it might be worth trying to give up on correlation and instead create the other two types of signals from one signal. In theory, if a certain signal carries enough information to encompass the other two signals, then this idea is feasible.

During last year’s Humans of OTel interview, most of the interviewees indicated that Traces are their favorite signal because they carry the most information. This led to the creation of the Span Metrics Connector:

  • Converts the number of Spans into a Counter, corresponding to the number of requests.
  • Converts Span Status into a Counter, its ratio to the total number of requests represents the error rate.
  • Converts the start and end times of Spans into a Histogram, corresponding to the distribution of request durations.

So we can easily obtain the R.E.D metrics.

Following this simple idea, as long as we keep enriching the Span with information, such as HTTP parameters or SQL statements, we can essentially obtain basic logs from the Span. You can convert them into logs as supplementary to the existing logs.

By the way, the Metrics and Logs obtained through the Span Metrics Connector means that they will all carry the Trace Context — because they are inherently part of the Traces.

eBPF

eBPF actually doesn’t have such a strong connection with correlating signals; it simply offers another method for generating and collecting Spans — if you haven’t embedded Trace SDK instrumentation points, then it can do this job for you. To be precise, it should be categorized as one of the Auto Instrumentation implementations, rather than as an implementation for correlating signals.

In observability, eBPF programs generally offer two types of support. The first type does not modify your code or network requests; it simply collects specific behaviors (such as network calls) and parses their content. The second type involves combining with the code and framework for instrumentation, producing the same effect as if you had embedded points with a Client SDK.

Both of these approaches can collect the corresponding Spans, but:

  • If there is no actual instrumentation, the Spans collected by eBPF lack the Trace Context. Correlating them relies on some underlying information, making the correlation quite difficult, both in terms of completeness and performance.
  • If actual instrumentation is to be performed, it must be closely integrated with the respective framework’s SDK. This requires tailored development efforts for each framework and also calls for attention to the potential performance overhead.

The eBPF Agent we are using employs the first approach. It effectively provides Metrics and Logs, but its performance with Traces does not meet expectations. Therefore, I have greater expectations for the second approach, which we may also consider exploring in the future.

The Missing Chapter

Is the span metrics connector really so simple and practical? The answer is No. During Observability Day, many details were not delved into due to time constraints and the selection of topics. I intend to include some additional information about it in this blog post.

Trace Sampling

One well-known difficulty with Traces is the handling of massive amounts of data. For instance, if globally there are 1 million Spans generated every second, and each Span is 100 Bytes in size, then the daily storage space required for these Traces would be:

1000000 * 100 Bytes / (1024 * 1024 * 1024 * 1024) Bytes/TiB * 86400 Seconds = 7.8 TiB

Annually, it requires 2800 TiB to store all the Trace Spans.

The cost of the disk space in this situation is significant, so users often set a sampling rate when generating those Traces to reduce the amount of data, such 0.1%.

Typically, the sampling rate is not included as an attribute within the Span. As a result, one challenge with the Span Metrics Connector is that the number of Spans (before sampling) is unknown and can fluctuate over different times of the day. Therefore, to ensure the accuracy of the Counter metric, additional information is necessary for correction.

The issue mentioned above arises with the use of head-based sampling alone. However, as business complexity increases and multiple sampling methods are adopted, the introduction of retroactive sampling — which ensures that erroneous Spans are reported, while normal Spans are reported at a fixed or dynamic rate — makes the task of correcting the span-generated error rate metrics more challenging.

In practice, there might be more complexities then we described. One method to address these issues is to centralize the control of the sampling rate within the platform. OpenTelemetry’s SDK provides an interface for sampling policies, allowing users to adopt custom sampling strategies instead of the default head-based sampling. Custom sampling strategies need to be coordinated with the observability platform, disseminating the sampling rate through a notification mechanism and caching them locally within the SDK and Span Metrics Connector. This allows for adjustments to the data during the conversion of Signals.

High Cardinality

Using the Span Metrics Connector almost inevitably results in high-cardinality metrics. I have already discussed some measures to handle high-cardinality metrics in my presentation, but in truth, they do not solve the problem but only postpone the time it arises.

We are evaluating the migration from Prometheus & Thanos to the entirely new VictoriaMetrics architecture. As a drop-in replacement for Prometheus, VictoriaMetrics offers better performance with lower resource overhead. As I mentioned in our internal research document, it can save us over 50% in resources.

So am I putting all my bets on the performance of VictoriaMetrics? Definitely not. No matter what storage solution we adopt, the capacity to support queries for high-cardinality metrics is always subject to limitations. The key reason for selecting VictoriaMetrics is:

  • Its horizontal scaling is smoother;
  • If the VMCluster can scale horizontally without bounds (or at least to a very large extent), we can create a single transparent proxy to sit in front of it, which would analyze metric usage and control the ingestion of high-cardinality metrics.

The second reason is very important for the new architecture, because in the Prometheus + Thanos ecosystem, many Prometheus instances and Thanos Sidecars are scattered across various clusters, making it challenging to implement a transparent proxy and quickly deploy it to cover all Prometheus instances.

We aware that centralized architecture also have corresponding downsides:

  • The system’s high-availability level will decrease, availability = Min(VMCluster, the transparent proxy, …);
  • If metric data is centrally stored, cross-AZ data transmission will increase traffic costs and introduce the risk of dedicated network lines (used for cross-AZ data transmission) availability issues.

To (partially) resolve these issues, adopting a replication architecture to maintain multiple copies of data for higher availability would exponentially increase costs, also leading to a more complex architectural solution.

Conclusion

The ideas I shared at this KubeCon arose after I started engaging with metrics management for a few months. So it’s actually not complicated.

As an end user, our engagement and development of these components often fall short of what vendors can achieve, so I believe the content shared is still intended for entry-level users, in hopes of providing inspiration to those who are new to the field.

At the same time, the idea behind the Span Metrics Connector is not as simple in practice as I made it out to be in my session presentation, especially when the data volume grows to a certain size. At that point, compromises between data accuracy and performance become necessary. For example, the capacity for Trace reporting and processing is limited, and one cannot ignore the impact of a wide variety of sampling policies on the data processing pipeline. Even if one manages to store vast amounts of data, the use of high-cardinality metrics poses risks at every turn, requiring a significant investment of resources for safeguarding.

Behind The Scene

This blog is written on my 243rd day working at Quwan, which is also the seventh month since I started working in the infrastructure field. I had never done anything related to infrastructure before, but in the past seven months, I’ve completed three external presentations (two at KubeCon and one at KCD). I think it’s not easy for newcomers to step onto the stage and face the audience. I feel very fortunate to be able to work in an area that interests me, and I believe this is a perfect example of interest-driven learning.

However, I believe that learning requires deeper reflection and continuous accumulation of knowledge. So this is probably the last external sharing session in the next few years. What makes me happy is that I have already seen the small impact of technical sharing, influencing the people around me, making them willing to share and start to try submitting their CFPs.

Thank you to KubeCon. Good night, Paris.

--

--