Unlocking Observability with OpenTelemetry

What is Observability?

The landscape of software development is transitioning from monolithic systems towards distributed microservices. These services are built on top of containers, service meshes, and serverless components, and almost all in the cloud. This allows for maximum flexibility for development and deployment. It, however, presents a new challenge to address, since the components of a distributed system are created by diverse teams with different measures for reliability (different test coverage, monitoring … etc.). How can an organisation achieve resilience and fault tolerance?

To quote Cindy Sridharan in her book Distributed Systems Observability A Guide to Building Robust Systems.

In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

No complex system is ever fully healthy.

Distributed systems are pathologically unpredictable.

It’s impossible to predict the myriad states of partial failure various parts of the system might end up in. Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.

The ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

Most organisations will ask their operations team to use monitoring tools and alert engineers when something acts abnormally. Observability, however, is not just the responsibility of operations teams. It starts with the developers, then the testers, and finally ends with the operations engineers and site reliability engineers (SREs). This is detailed in the aforementioned book as follows:

Observability is a feature that needs to be enshrined into a system at the time of system design such that:

A system can be built in a way that lends itself well to being tested in a realistic manner (which involves a certain degree of testing in production).

A system can be tested in a manner such that any of the hard, actionable failure modes (the sort that often results in alerts once the system has been deployed) can be surfaced during the time of testing. A system can be deployed incrementally and in a manner such that a roll-back (or roll forward) can be triggered if a key set of metrics deviate from the baseline.

And finally, post-release, a system can be able to report enough data points about its health and behavior when serving real track, so that the system can be understood, debugged, and evolved.

Observability pillars

Logs, metrics, and traces are frequently cited as the pillars of Observability.

Logs

An event log is a record of an event that happened at a certain point in time. For example, a certain method call. It’s always immutable and timestamped and comes in one of the following three formats:

  • Plaintext, like Apache access logs
  • Structured, for instance, JSON event log
  • Binary, e.g. MySQL binlogs or systemd journal logs

An event log on its own is not sufficient, this is why a Log Stream (stream of Events) traverses the path of a request through the system.

Metrics

Measured over fixed time intervals, this numeric representation gives an idea of how the system behaves. Enables us to build dashboards to represent historical trends.

Tracing

If metrics are geared to describe the system as a whole, tracing is meant to focus on individual requests and their journey through the system:

  • How long did the request spent on a particular service,
  • Did that service fail or succeed, and so on

OpenTelemetry in action

OpenTelemerty is an open-source project that aids in the creation and management of telemetry data such as traces, metrics, and logs. It facilitates telemetry ingestion from various sources, leading to a deeper understanding of how issues are interconnected. For more information about it, please refer to this documentation and this java project.

Setting up OpenTelemetry in an application

We are using a java application in this post. in this step, we will add the following dependencies to pom.xml to help us propagate information from a service to another. They also help developers create custom metrics, spans, … etc.

<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>0.7.1</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-extension-auto-annotations</artifactId>
<version>0.7.1</version>
</dependency>

Running the Java agent

OpenTelemetry provides an agent and collector, that works in a way similar to newrelic, AppDynamics, or fluentd. Download the agent here. The agent will be running within the docker container of the service using docker-compose.

my-service:
image: ************.dkr.ecr.our-region.amazonaws.com/myorg/openjdk-mybaseimage:14-alpine-aws-cli-maven
container_name: my-service
ports:
- 8080:8080
- 5005:5005
volumes:
- .:/app
     - ${M2_PATH}:/root/.m2
working_dir: /app
environment:
AWS_ACCESS_KEY_ID: local
AWS_SECRET_ACCESS_KEY: local
AWS_DEFAULT_REGION: our-region
OTEL_TRACE_EXPORTER: otlp
OTEL_METRICS_EXPORTER: otlp
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_RESOURCE_ATTRIBUTES: service.name=my-service,service.namespace=my-namespace,service.instance.id=somehost,service.version=BOGUS
MAVEN_OPTS: |
...........
-Dspring-boot.run.jvmArguments="-javaagent:/app/opentelemetry-javaagent-all.jar"
entrypoint:
- bash
- -c
command: >
"
...........
mvn --file my-service/pom.xml spring-boot:run
"

Notice the maven option

-Dspring-boot.run.jvmArguments=”-javaagent:/app/opentelemetry-javaagent-all.jar

and the environment variables

  • OTEL_TRACE_EXPORTER, specifies the trace exporter to be
  • OTEL_METRICS_EXPORTER, specifies the metrics exporter to be used
  • OTEL_EXPORTER_OTLP_ENDPOINT, defines the endpoint where the collector can be reached
  • OTEL_RESOURCE_ATTRIBUTES, this service’s attributes

Running the Java collector

OpenTelemetry provides a docker image for the collector, here is how to use it (again adding to docker-compose.yml)

otel-collector:
image: otel/opentelemetry-collector-contrib
container_name: otel-collector
ports:
- "4317:4317" # OTLP gRPC receiver
- "8889:8889" # Prometheus metrics exposed by the agent
- "8888:8888" # Prometheus metrics exposed by the agent
- "14250" # Jaeger grpc receiver
- "14268" # Jaeger http thrift receiver
- "55678" # OpenCensus receiver
- "9411" # Zipkin receiver
- "1777:1777" # pprof extension
- "55679:55679" # zpages extension
- "13133:13133" # health_check extension
volumes:
- ./collector:/etc/otel
environment:
AWS_ACCESS_KEY_ID: local
AWS_SECRET_ACCESS_KEY: local
AWS_REGION: our-region
command: ["--config=/etc/otel/config.yaml", "--log-level=DEBUG"]

The above image will require a configuration file (we saved it to `collector/config.yml), a copy of the configuration used for our example, are below

# https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver/otlpreceiver
receivers:
otlp:
protocols:
grpc:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: promexample
const_labels:
label1: value1
send_timestamps: true
logging:
loglevel: debug
zipkin:
endpoint: "http://zipkin:9411/api/v2/spans"
format: proto
jaeger:
endpoint: jaeger:14250
insecure: true
processors:
batch:
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging, zipkin, jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [logging, prometheus]

Tracing and monitoring tools

Now all that is left is to configure tools that we use for tracing and monitoring, for this example we chose two tracing tools (zipkin and jaeger ) and one tool for monitoring metrics (prometheus).

Tracing

Jaeger

Add the following snippet to docker-compose.yml

jaeger:
image: jaegertracing/all-in-one:latest
container_name: Jaeger
ports:
- "16686:16686"
- "14268"
- "14250"

Zipkin

Add the following snippet to docker-compose.yml

zipkin:
image: openzipkin/zipkin:latest
container_name: zipkin
ports:
- "9411:9411"

Monitoring

Prometheus

Add the following snippet to docker-compose.yml

prometheus:
container_name: prometheus
image: prom/prometheus:latest
volumes:
- ./collector:/etc/prometheus
ports:
- "9090:9090"

Prometheus will need another file for configuring the Prometheus container upon starting.

Add the following snippet to collector/prometheus.yml

scrape_configs:
- job_name: "my-service"
scrape_interval: 10s
static_configs:
- targets: ["otel-collector:8889"]
- targets: ["otel-collector:8888"]

Running the application

docker-compose down;docker-compose up;

Capture traces and metrics

Jaeger

Zipkin

Prometheus

References:

Information has been prepared for information purposes only and does not constitute advice.

--

--