Unlocking Observability with OpenTelemetry

Published in

Engineering the Skies: Qantas Tech Blog

6 min readJul 30, 2023

What is Observability?

The landscape of software development is transitioning from monolithic systems towards distributed microservices. These services are built on top of containers, service meshes, and serverless components, and almost all in the cloud. This allows for maximum flexibility for development and deployment. It, however, presents a new challenge to address, since the components of a distributed system are created by diverse teams with different measures for reliability (different test coverage, monitoring … etc.). How can an organisation achieve resilience and fault tolerance?

To quote Cindy Sridharan in her book Distributed Systems Observability A Guide to Building Robust Systems.

In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:
No complex system is ever fully healthy.
Distributed systems are pathologically unpredictable.
It’s impossible to predict the myriad states of partial failure various parts of the system might end up in. Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
The ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

Most organisations will ask their operations team to use monitoring tools and alert engineers when something acts abnormally. Observability, however, is not just the responsibility of operations teams. It starts with the developers, then the testers, and finally ends with the operations engineers and site reliability engineers (SREs). This is detailed in the aforementioned book as follows:

Observability is a feature that needs to be enshrined into a system at the time of system design such that:
A system can be built in a way that lends itself well to being tested in a realistic manner (which involves a certain degree of testing in production).
A system can be tested in a manner such that any of the hard, actionable failure modes (the sort that often results in alerts once the system has been deployed) can be surfaced during the time of testing. A system can be deployed incrementally and in a manner such that a roll-back (or roll forward) can be triggered if a key set of metrics deviate from the baseline.
And finally, post-release, a system can be able to report enough data points about its health and behavior when serving real track, so that the system can be understood, debugged, and evolved.

Observability pillars

Logs, metrics, and traces are frequently cited as the pillars of Observability.

Logs

An event log is a record of an event that happened at a certain point in time. For example, a certain method call. It’s always immutable and timestamped and comes in one of the following three formats:

Plaintext, like Apache access logs
Structured, for instance, JSON event log
Binary, e.g. MySQL binlogs or systemd journal logs

An event log on its own is not sufficient, this is why a Log Stream (stream of Events) traverses the path of a request through the system.

Metrics

Measured over fixed time intervals, this numeric representation gives an idea of how the system behaves. Enables us to build dashboards to represent historical trends.

Tracing

If metrics are geared to describe the system as a whole, tracing is meant to focus on individual requests and their journey through the system:

How long did the request spent on a particular service,
Did that service fail or succeed, and so on

OpenTelemetry in action

OpenTelemerty is an open-source project that aids in the creation and management of telemetry data such as traces, metrics, and logs. It facilitates telemetry ingestion from various sources, leading to a deeper understanding of how issues are interconnected. For more information about it, please refer to this documentation and this java project.

Setting up OpenTelemetry in an application

We are using a java application in this post. in this step, we will add the following dependencies to pom.xml to help us propagate information from a service to another. They also help developers create custom metrics, spans, … etc.

<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-api</artifactId>
  <version>0.7.1</version>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-extension-auto-annotations</artifactId>
  <version>0.7.1</version>
</dependency>

Running the Java agent

OpenTelemetry provides an agent and collector, that works in a way similar to newrelic, AppDynamics, or fluentd. Download the agent here. The agent will be running within the docker container of the service using docker-compose.

my-service:
   image: ************.dkr.ecr.our-region.amazonaws.com/myorg/openjdk-mybaseimage:14-alpine-aws-cli-maven
   container_name: my-service
   ports:
     - 8080:8080
     - 5005:5005
   volumes:
     - .:/app

     - ${M2_PATH}:/root/.m2
   working_dir: /app
   environment:
     AWS_ACCESS_KEY_ID: local
     AWS_SECRET_ACCESS_KEY: local
     AWS_DEFAULT_REGION: our-region
     OTEL_TRACE_EXPORTER: otlp
     OTEL_METRICS_EXPORTER: otlp
     OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
     OTEL_RESOURCE_ATTRIBUTES: service.name=my-service,service.namespace=my-namespace,service.instance.id=somehost,service.version=BOGUS
     MAVEN_OPTS: |
       ...........
       -Dspring-boot.run.jvmArguments="-javaagent:/app/opentelemetry-javaagent-all.jar"
   entrypoint:
     - bash
     - -c
   command: >
     "
     ...........     mvn --file my-service/pom.xml spring-boot:run
     "

Notice the maven option

-Dspring-boot.run.jvmArguments=”-javaagent:/app/opentelemetry-javaagent-all.jar

and the environment variables

OTEL_TRACE_EXPORTER, specifies the trace exporter to be
OTEL_METRICS_EXPORTER, specifies the metrics exporter to be used
OTEL_EXPORTER_OTLP_ENDPOINT, defines the endpoint where the collector can be reached
OTEL_RESOURCE_ATTRIBUTES, this service’s attributes

Running the Java collector

OpenTelemetry provides a docker image for the collector, here is how to use it (again adding to docker-compose.yml)

otel-collector:
  image: otel/opentelemetry-collector-contrib
  container_name: otel-collector
  ports:
    - "4317:4317" # OTLP gRPC receiver
    - "8889:8889" # Prometheus metrics exposed by the agent
    - "8888:8888" # Prometheus metrics exposed by the agent
    - "14250" # Jaeger grpc receiver
    - "14268" # Jaeger http thrift receiver
    - "55678" # OpenCensus receiver
    - "9411" # Zipkin receiver
    - "1777:1777" # pprof extension
    - "55679:55679" # zpages extension
    - "13133:13133" # health_check extension
  volumes:
    - ./collector:/etc/otel
  environment:
    AWS_ACCESS_KEY_ID: local
    AWS_SECRET_ACCESS_KEY: local
    AWS_REGION: our-region
  command: ["--config=/etc/otel/config.yaml", "--log-level=DEBUG"]

The above image will require a configuration file (we saved it to `collector/config.yml), a copy of the configuration used for our example, are below

# https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver/otlpreceiver
receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: promexample
    const_labels:
      label1: value1
    send_timestamps: true
  logging:
    loglevel: debug
  zipkin:
    endpoint: "http://zipkin:9411/api/v2/spans"
    format: proto
  jaeger:
    endpoint: jaeger:14250
    insecure: trueprocessors:
  batch:extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, zipkin, jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, prometheus]

Tracing and monitoring tools

Now all that is left is to configure tools that we use for tracing and monitoring, for this example we chose two tracing tools (zipkin and jaeger ) and one tool for monitoring metrics (prometheus).

Tracing

Jaeger

Add the following snippet to docker-compose.yml

jaeger:
  image: jaegertracing/all-in-one:latest
  container_name: Jaeger
  ports:
    - "16686:16686"
    - "14268"
    - "14250"

Zipkin

Add the following snippet to docker-compose.yml

zipkin:
  image: openzipkin/zipkin:latest
  container_name: zipkin
  ports:
    - "9411:9411"

Monitoring

Prometheus

Add the following snippet to docker-compose.yml

prometheus:
  container_name: prometheus
  image: prom/prometheus:latest
  volumes:
    - ./collector:/etc/prometheus
  ports:
    - "9090:9090"

Prometheus will need another file for configuring the Prometheus container upon starting.

Add the following snippet to collector/prometheus.yml

scrape_configs:
  - job_name: "my-service"
    scrape_interval: 10s
    static_configs:
      - targets: ["otel-collector:8889"]
      - targets: ["otel-collector:8888"]