Microservice Observability is must

Microservice and Distributed Apps — Observability

Bikash Sundaray
TechMintMedia
10 min readAug 15, 2022

--

Today we are going to talk about modern microservice and distributed application monitoring and debugging. What are the standard practices to be followed which are accepted globally. Open Source tools and technology which is adopted by many companies across the globe. We will talk about multiple programming languages, frameworks, frontend and backend technology which need monitoring and debugging. We will talk about Observability which can be applicable to distributed system like Microservice Application backend, BigData and machine learning.

Also we will try to understand in simpler terms.

Let’s understand Metrics and Trace

Metrics, we specifically mean the class of information that allows you to talk about the performance of a system, it could be different components of a single app or a cluster.

Metrics are always counters in the form of numeric. So when we say metrics, that means it’s already aggregated. We have only abstract information. For example metrics is like website response time is “100 ms”. Other examples like last 5 minutes system has 10k new traffic. These numeric data about your application and system are Metrics.

Where as Tracing contains information ( not counters or figures), for example inside trace data we will find TraceId, client IP address, service name, datetime, payload size etc. If we will talk about distributed tracing, it will give you full information about a request. It may or may not contains all the micro-service details that request went through to serve the final response.

In Modern Application and system we need metrics as well as distributed tracing (Aka Trace). Metrics give us current performance indicator of our application where as distributed tracing helps of analyzing problem if performance goes down or debugging something on live system. Both can be Realtime data or near Realtime data depends on what you have implemented.

Tracing, Metrics and Monitoring

Let me tell you, neither tracing nor metrics are new. Metrics and Tracing are used in microservices but way before that tracing and metrics was there and was getting used by larger applications ( single service or monolith application or system monitoring ).

Monitoring is used to monitor your systems performance and discover the particular areas that having some issue or need your attention. It’s basically reactive mode. You monitor your system and application so that you can take action when something goes wrong.

Observability giving you the insight to see how each piece of your system fits together (micro-services) so you can determine what to look at when it matters most. Let’s say you want to see which component/microservice which are making site slow. Mostly observability gives your answer to your question, where as monitoring has got only abstract information. It’s more to proactive mode.

In the Observability space, 4 cloud native standard has been involved in the recent year

  1. OpenTracing
  2. OpenCensus
  3. OpenMetrics
  4. OpenTelemetry

OpenTelemetry is trying to solve bigger problem, OpenTelemetry combines OpenTracing, OpenCensus and OpenMetrics and making as single industry cloud native data format standard for Observability area.

Already OpenTracing is implemented in OpenTelemetry and currently at stable state. OpenCensus is all-most at the edge of stable condition of implementation. Implementing OpenMetrics just began. Out if 3 OpenMetrics is bit different, as OpenMetrics is trying to enforce Prometheus standard at the facto standard for Metrics data format.

Let’s now consider Observability

Observability

3 pillars of Observability are Logs, Metrics and Trace. You can call a system is implemented with Observability, that means you are already using techniques, tools and technology to generate logs, metrics and Traces. It could be 1 platform that you are using or you are using multiple tools and technology to generate all 3.

Observability is the ability to measure the internal states of a system by examining its outputs. A system is considered “observable” if the current state can be estimated by only using information from outputs — That means there should be way to analyze system that it is working as expected or not from a indicator or metrics or something similar.

Don’t confuse your-self with Observability and Monitoring Observability is a property of that system, like functionality or testability where as Monitoring is an action you perform to increase the observability of your system.

I know, my last highlighted paragraph has confused you a bit. So let’s simplify. There should be way in your system where I can look at metrics at any given point of time ( like dashboard ), from there I suspect if anything going wrong (low speed or drop traffic or higher error rate, slow response time), from there I can pick one instance and search for different traces ( like list of request), then by clicking that I can see details logs ( Exact log of that request). This is basically part of observability. Monitoring in the other hand, you are trying to monitor your system for performance, downtime, service-restart etc and gets alerts if that happens.

Observability combines trace, metrics and logs for microservice to microservice communication, DB communication, Message-Queue communication etc. So anything happening inside the system, so you should have clear metrics, trace and logs for it. That fulfills Observability.

Is there any Open Source tool which I can use to create Observability ?

Among the best Opensource Observability back-end are Jaeger, Zipkin and Prometheus. Jaeger and Zipkin will be used for OpenTracing, Prometheus is for Metrics and you can use ELK stack for logging. There is new comer to the list is under Apache called “ Apache Skywalking “. I personally like the tool but not tested yet. Apache Skywalking gives you single platform for metrics, trace and logging. If you don’t want to use Apache Skywalkinh then you have to use 3~4 tools to implement observability in your platform.

What is Open Telemetry and how it related to Observability ?

Recently (in 2021) Observability data format got standardized and the given name was “OpenTelemetry”, which means there is a specific project and maintainer who maintains standard required for Observability and they are called OpenTelemetry data format and standard. OpenTelemetry combines OpenTracing and Opencensus project and comes to an conclusion of single data telemetry format to generate Metrics, Trace, Logs and Events. Before OpenTelemetry, every tools coming for Observability was trying to enforce different format as per their need, which was promoting vendor lock. That means if you are X tool, then you have to use X tool forever until you are reimplementing newer Y tools format in your code.

If you are using OpenTelemetry format, that means your application and system is generating OpenTelemetry data format which can ingested to any backend system for further analysis. OpenTelemetry generate a data format called OTLP but it can generate a vendor specific data format. So OpenTelemetry allows you switch to any Observability platform or backend without code changes, so technically there is no vendor lock-in.

Note — Even-if I am saying its standardized, its not IEEE standard, rather CNCF supported project. That’s how Cloud Native ecosystem works.

So if you are planning to start Observability for your system, please use OpenTelemetry at your client side and choose any platform or backend to store data.

Till 2022, you will find tons of application are using OpenTracing as tool for Observability. But now OpenTracing it-self is asking to use Observability. List of vendor supporting OpenTelemetry are here natively supported by a number of vendors.

Trace and Metrics are stable in OpenTelemetry current implementation. But logs are not part of current SDK. To implement OpenTelemetry OTLP format logs, you need to relay on Some logging tools/library which are using OTLP logging format as a collector. For example LOG4J. OpenMetrics is still not standardized and under active development in OpenTelemetry.

How i can start Observability ?

To start using OpenTelemetry, there are 2 types of implementing OpenTelemetry

  1. Manual Instrumentation ( You have to use OpenTelemetry package inside your code to generate Metrics, Trace and Logs, you have to write code specifically related to OpenTelemetry data generation)
  2. Auto Instrumentation ( OpenTelemetry will auto generate Metrics, Trace and Logs for you)

You can check your framework documentation, how it allowing you to use OpenTelemetry.

List of reference for Java and python based application.

For more information, you can check following supported programming languages, which has Auto instrumentation support.

More information about Jaeger, Zipkin and Prometheus

#1 Where they store their data — Storage solution

  • Jaeger stores in Cassandra 3.4+ and Elasticsearch 5.x/6.x/7.x
  • Jaeger current experiments with other databases, such as ScyllaDB, InfluxDB, Amazon DynamoDB, Logz.io
  • Cassandra and Elasticsearch are the primarily supported storage backends by Jaeger.
  • Zipkin was originally built to store data in Cassandra, but it later started supporting Elasticsearch and MySQL too
  • Prometheus has its own timeseries database inbuilt. It stores data persistent to disk.

#2 Data Format supported by each tool

  • Zipkin format supports Thrift, JSON v1/v2 and Protobuffer
  • Zipkin API https://zipkin.io/zipkin-api/#/default/post_spans
  • Common ways to send data to Zipkin is via http or Kafka
  • Jaeger supports Protobuf for gRPC
  • Jaeger also support thrift/JSON format over Http
  • Zipkin format is also supported by Jaeger. That means if you are using Zipkin and wants to migrate to Jaeger then its easy.

#3 Companies behind each tool

Jaeger is based on Go Lang (was originally built by teams at Uber) — for per service tracking called spans

Zipkin is based on java (Zipkin was originally inspired by Google’s Dapper and was developed by Twitter). Zipkin is a much older project than Jaeger

Prometheus is developed by SoundCloud

#4 What else beyond Zipkin, Jaeger and Prometheus.

Well there are lot of SaaS platform, which make our life easy for observability. But it’s all paid service, you have to pay for it to store and analyze your data. Also data will not be stored in house or your cloud cluster, rather it will be stored in their system.

Following are few best SaaS platform for Observability. They all started supporting OpenTelemetry OTLP format. Also they have their own system of data collection i.e. platform specific instrumentation. I personally used Splunk and Datadog.

  • Splunk
  • Dynatrace
  • Datadog
  • New Relic
  • Sentry
  • Logz.io

Following diagram shows, how Datadog platform can use OpenTelemetry instrumentation or datadog own instrumentation agent. It depends how you use it.

Challenges and Solution

How current system is using Complex Setup for Observability

Challenges

  • There is no way to link co-relation between multiple observability system like Jaeger and Prometheus and Elastic Search.
  • Still its multiple tools are used to complete the observability setup. There are platforms which combines all 3 but they are paid like DataDog and Splunk. Of course we have Apache Skywalking which is OpenSource and gives same kind of capability for Observability.

Following diagram illustrates, current challenges and how big product has done their setup for Observability.

This is how application these days are implemented Logging, Tracing and Metrics collection. So for logging we are using different agent, for tracing we are using different agent and for metrics we are using different agents. As an Example you can say logstash is using as log collector and stored on Elastic Search. jaeger agents are used to collect distributed traces. Prometheus agents are used to collect metrics and store on Prometheus or influxDB.

What is the way forward solution to above problem

The conclusion is, OpenTelemetry solved the problem of Tracing and Metrics, to enforce standard. But Logs part still under development and it will be fixed soon.

Tool that we discussed, here are the links and quick code snippet

Jaeger

Jaeger started supporting OpenTelemetry starting version Jaeger v1.35 release. In this version Jaeger introduced the ability to receive OpenTelemetry trace data via the OpenTelemetry Protocol (OTLP). Also along with that Jaeger announced its client SDK retirement as OpenTelemetry client will replace it.

Quickest way to start Jaeger in your system

Now send some OTLP data to Jaeger backend and keep exploring Observability.

Apache SkyWalking

Apache SkyWalking is a Modern OpenTelemetry backend which can store metrics, trace send my openTelemetry collector and exporter.

Apache SkyWalking accepts both trace and metrics data.

Zipkin

Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in service architectures. Features include both the collection and lookup of this data.

Prometheus

Prometheus deals with metrics where as Zipkin and Jaeger deals with Distributed tracing. So you can use Jaeger for distributed tracing but as there is no support for metrics data on those tools, you have to choose tools which can handle metrics, for that we have prometheus. prometheus it-self is a timeseries database plus software ware which can handle metrics data and same can be integrated with Visualization tool like Grafana to data visualization.

you also need to setup Grafana and Alert manager with Prometheus.

I have not gone very detailed about any language specific implementation. But I can write that in another article. So far Observability from OpenTelemetry prospective should be very clear.

Thanks for Reading. Don’t forget to share it if you find it informative.

--

--

Bikash Sundaray
TechMintMedia

AI is my Passion and Innovation is my Energy. I work on ML, IoT, DevOps, Backend, Cloud, Mobile Apps, Security and Frontend