How Cdiscount uses distributed tracing to monitor its microservices?

AymericD
Peaksys Engineering
6 min readJun 6, 2022

The Cdiscount Tech Unit’s information system is based on a distributed system comprised of nearly 900 microservices within a Kubernetes cluster of nearly 30,000 CPU. A query, client or system thus crosses several containers in order to calculate the requested response.

We target a 99.99% uptime on our high-level services (order process…) and so a precise monitoring and alerting system is mandatory to identify as fast as possible incidents to achieve our global availability goal.

The microservices already individually expose metrics and logs that provide us with their health status, raise alerts, and allow us to debug when an incident occurs. However, the cascade of calls between the various containers makes it difficult to know which component is really behind a problem or incident: the entire grapevine of services leading up to the problem all raise alerts at the same time. This needlessly increases the time needed to correct the issue, and the analysis needed to rebuild the query as a whole is always difficult to carry out.

This observation has led us to set up distributed tracing in order to highlight all the micro-services that a user query goes through and see the time a query spends in the application code of each component.

Gathering Traces in a central system to find the source of each problem or incident

A trace shows the data or execution pathway through a system. It is defined by one ID and at least one span. It includes the name of the operation, the start time and the duration.

There are a variety of tracing tools available: Zipkin, Jaeger, Tempo, etc. At Cdiscount, we chose the Jaeger and OpenTelemetry standards for the instrumentation of our applications. Jaeger is used as a backend to store the traces, and OpenTelemetry propagates and enriches them. Finally, for the storage aspect, we chose Elasticsearch to capitalise on our existing knowledge in the domain. The stack is deployed on Kubernetes through a Chart Helm and Azure DevOps is used for the CI/CD part.

We chose an open-source architecture : that is our culture

OpenTelemetry as a shared language

Our applications use the OpenTelemetry library to generate the traces and spans. OpenTelemetry is a standard that is supported by most observability solution providers. Our containers are set up to push traces to Jaeger collector instances. It is still up to the developers to fine-tune the instrumentation within the applications.

Jaeger collector to receive traces

Its role is to receive the traces that come from applications, confirm them, index them, sample them, and then push them to Kafka. Sampling means we only keep a portion of the traces, and they can be handled on either the server or client side. A balance has to be found between using up disk space and detecting weak signals: a large volume of stored traces costs money but helps to find problems that occur infrequently.

The sampling strategy

We chose to set up our sampling on the server side in order to have a consistent strategy throughout the platform. There are two ways to handle sampling:

rate limiting

This strategy allows us to set a certain number of traces per second to keep. For example, if we set a value of 20, then Jaeger keeps the first 20 traces per second for a given service. The other traces within this time span are ignored. Once the second is over, the counter resets to 0. Since this storage method can introduce a statistical bias by only keeping the initial traces of each temporal quantum, we chose not to use it.

probabilistic

This is the strategy we chose. Here, we set a percentage of traces to keep. For example, if we set a value of 20%, then Jaeger only keeps around two traces of every 10 it receives. It decides which traces to keep totally randomly, which allows us to sample all types of traces with maximum efficiency.

We should add that it is possible to fine-tune the sampling: we have set a global “probabilistic” strategy that applies by default, but we also have exceptions for certain services and endpoints where we want to keep 100% of traces. The benefit of this type of configuration is to be able to adjust the percentage of traces kept according to how many traces each service leaves. Thus, if a service leaves few traces, then we can increase the percentage of traces that we keep for this service. However, if a service leaves a very large number of traces, then we can reduce the percentage of traces that we keep for it. All without impacting other existing services.

Kafka for asynchronous processing

With more than 2,000 applications in our system, there is a very high number of traces generated and collected. This is why we chose Kafka to absorb all this.

Jaeger Ingestor to take in

The ingestor’s role is simple: it consumes the traces from Kafka and pushes them to Elasticsearch.

Elasticsearch to store information

Elasticsearch’s role is to handle trace storage. Retention is set at seven days because the volume of traces can be quite big, and storing traces for any longer than that is not useful since they are primarily used for debugging. The metrics are already used to monitor the latency of our microservices.

Jaeger Query for human analysis

Jaeger Query provides a UI to be able to query and view the traces.

Homepage: gives an overview of all the traces with a bar graph showing the latency and list of each trace
Display a trace

Grafana as a hypervision tool

We can set up a Jaeger data source to view the traces from a Jaeger backend.

List the traces
Display a trace

The advantages of using Grafana to view the traces are authentication and the correlation between the logs, the traces and the metrics. Another benefit is having a single interface to view these logs, metrics and traces.

The teams love to take possession of their traces

Several dozen teams have already migrated to our new trace management platform. They are self-sufficient in analysing incidents and handling problems within a logic of continuous improvement.

Feedbacks are great: teams love to have a self-service tool available whenever they need them. And they interact with our observability team only for precise technical questions.

For our dedicated component team, observability is a subject that is always evolving. One of the purposes of observability is to make it easier to detect incidents to reduce the MTTD and MTTR. The correlation of signals is essential.

What’s next?

Observability is comprised of three types of signals:

  • Metrics: Numerical representations of an aggregated status for a period of time.
  • Logs: A structured, human-readable representation of an event.
  • Traces: A set of metadata that can be linked to the lifecycle of a single entity in the system (for example, a request or a transaction).

Our next project consists of improving the correlations and navigation between the three pillars so we can make the most of these complementary data. Our next goal is to identify low signal problems hidden in all the data we gather. Thaks to an IA we are currently developping.

References

--

--