A song of Decentralization and Observability: Dance with OpenTelemetry

Published in

Nordcloud Engineering

13 min readAug 3, 2022

Introduction

Following the announcement of general availability on tracing with AWS Distro for OpenTelemetry, There has been quite a buzz in the AWS developers community to enable a single pane of glass to enable tracing in the world of decentralized architectures. In this article, we will describe how observability and open telemetry fits into one of the most critical pillars of the Well-Architected Framework — Operational Excellence of AWS Ecosystem.

Observability

In modern technology environments, observability is a process that uses software tools to detect issues by observing both the inputs and outputs of the technology stack. Inputs include application and infrastructure stacks, while outputs include business transactions, user experiences, and application performance.

Observability tools collect and analyze a broad spectrum of data, including application health and performance, business metrics like conversion rates, user experience mapping, and infrastructure and network telemetry to resolve issues before they impact business KPIs.

Over the last several years, enterprises have rapidly adopted cloud-native infrastructure services, such as AWS, in the form of microservices, serverless, and container technologies. In the IT ecosystem, Observability refers to the ability to understand an application’s performance based on output data, or telemetry. In de-centralized or microservices-based architecture design, telemetry can be divided into three major categories:

Traces: contextual data about a request through a system
Metrics: quantitative information about processes
Logs: specific messages emitted by a process or service

Observability in software is about the integration of multiple forms of telemetry data which together can help us better understand how our software is operating. Advantages of integrating them into the software:

Effective and efficient monitoring of workloads in the Cloud.
More Control and tracing capabilities of distributed systems.
Enhance birds-eye view for IT administrators and system architects over the total land space of the system.

Observability: OpenTelemetry(Otel)

OpenTelemetry(Otel) is a collection of vendor-agnostic open-source tools, APIs, and SDKs that aims to offer an all-in-one framework for the collection of telemetry data and distribution of it to observability platforms. It is an incubating project managed by the Cloud Native Computing Foundation (CNCF), which also manages the Kubernetes framework, alongside a few other open-source container technology frameworks.

There are multiple facets of the OpenTelemetry:

APIs are used to instrument our code to generate traces, Most libraries are expected to come with OpenTelemetry capabilities out of the box in the not-too-distant future. We will showcase the AWS-managed Prometheus writer API endpoint in our demo.
SDKs to collect such data and pass it to the processing and export stages. AWS Xray SDK has been incorporated into our demo.
In-process exporters, running with the application, can translate the telemetry data into custom formats and send it directly or through a collector to back-ends,
The out-of-process collector is used for data filtering, aggregation, batching, and communication with various telemetry backends.

The purpose of OpenTelemetry is to simplify the collection and management of telemetry data to enable developers to adopt observability best practices. OpenTelemetry has support from some of the biggest companies in the tech industry, with active contributions from Microsoft, Google, Amazon, Red Hat, Cisco, and many others.

The ultimate goal for OpenTelemetry is to ensure that this telemetry data is a built-in feature of cloud-native software. This means that libraries, frameworks, and SDKs should emit this telemetry data without requiring end-users to proactively instrument their code.

Benefits:

It provides a standard vendor-agnostic interface.
Opentelemetry(Otel) supports the three main types of telemetry data in form of metrics, traces, and logs.
Flexibility and ease of use are also other advantages of OpenTelemetry.

In the upcoming sections, We will showcase an end-to-end example of a microservice written in Java and Springboot which is deployed in AWS ECS as a service and load-balanced by AWS Elastic load balancer. AWS X-ray SDK has been used to incorporate the telemetry data to be parsed and sent across the Prometheus endpoint which will be visible in both AWS Cloudwatch as well as AWS-managed Grafana.

Solution Design

Several patterns can be used for deploying telemetry-enabled solutions for observability. The common ones include

Sidecar pattern: A common practice in the observability world is to use sidecars to provide container instrumentation. The advantage of the sidecar pattern is configuration and troubleshooting are quite easy. However, multi-level microservices architecture with thousands of small-scale services will add up the cost as well make this setup expensive.

AWS ECS service pattern: Amazon ECS service deployment pattern is similar to the DaemonSet pattern in Kubernetes. An Amazon ECS service allows us to run and maintain a specified number of instances of an OTel task in ECS. The advantage of this pattern is cost saving. Compute costs are reduced because the number of instrumentation containers no longer has a 1:1 relationship with the application containers.

We will be showcasing the 2nd pattern in this blog. However, setting up automating the deployments and CI/CD is out of scope for this blog. We have also previously showcased how to enable Continuous deployment and Continuous delivery when it comes to AWS Cloud infrastructure as a Code development using AWS CDK. Here is the link to the blog.

We will explain each of the components of the project and also link them to each of the components shown in the diagram above. Here is the link to the Repository in case anybody wants to experience the full flavor of the solution design.

Technology and AWS Services Used

Java Springboot application to create microservice for creating REST APIs
Docker for encapsulating the service to run in AWS ECS Fargate container service
AWS Services used to showcase tracing: AWS DynamoDB, S3, ELB, Route53, Managed Prometheus, and AWS ECS/ECR
Infrastructure is deployed using AWS Cloudformation scripts

Application High-level Design Overview

The ECS Service is written in Spring boot and mainly consists of 2 APIs

/api/healthcheck → This API is primarily for the AWS load balancers and AWS ECS Services to ping and ensure the tasks are running correctly for sending traffic through the AWS ELB DNS endpoint. Once we start up the application, there will be traces displayed for this in the AWS Xray dashboard console as well.
/api/index → This API is used for counting the number of AWS S3 buckets by invoking the AWS S3 SDK(list buckets) and also to fetch sample count from AWS DynamoDB using the DynamoDb SDK(query). There is also an invocation to an external API(example: Nordcloud) using a Java Spring Native library

AWS Deployment Landscape: Cloudformation Scripts

We will start the design by creating the AWS Cloud Landscape to deploy the microservices as a docker container within AWS Elastic container service. Here is the overview of the AWS components which will be created by the AWS Cloudformation scripts provided within the repository.

We will discuss each of the components and the corresponding AWS Cloudformation scripts related to the same.

AWS VPC with CIDR range: 10.215.0.0/16 consisting of 2 public and 2 private subnets along with an Internet Gateway(IGW) and NatGateway

2. AWS ECS Cluster for Creating ECS Fargate containers

3. AWS ECS Service which is built from the Java spring boot microservice containing the AWS S3, Xray, and DynamoDB SDK.

Note: Please ensure the task execution role has the following managed policies attached to it to enable pushing telemetry data to the AWS cloud Prometheus endpoint: AWSXrayWriteOnlyAccess, AWSXRayDaemonWriteAccess, AmazonPrometheusRemoteWriteAccess

Also, we are using the AWS ECR repository to dockerize the microservice and refer to that in the above Cloudformation template under the TaskDefinition component

3. AWS Elastic Load Balancer and Listeners/Target groups for configuring and sending traffic towards the AWS ECS Service

OpenTelemetry Collectors Discovery Configuration

In this section, We will showcase how the architecture behind the Opentelemetry works for our demo. Using AWS Cloudformation, We will create the AWS ECS Service pattern which will consume the AWS-managed Prometheus endpoint created by the above cloud formation script, and also use the AWS Service Discovery feature to enable other services to discover this service and send the metrics towards this service. This configuration will enable the logs, and metrics to be showcased in the AWS Cloudwatch dashboard. The below AWS Cloudformation scripts help us to create the AWS Services required to complete this full architecture

AWS Deployment Landscape: Cloudformation Scripts

AWS Managed Prometheus Endpoint to cater to the ingestion requirements of the traces and metrics data of the services for consolidating and showcasing in AWS Cloudwatch or AWS Managed Graphana dashboard.

2. AWS Route 53 Private hosted zone for enabling the Service discovery feature. AWS ECS service can register with a friendly and predictable DNS name in Route 53. The hosted zones are updated automatically as our Amazon ECS services scale up or down.

3. AWS ECS service for open telemetry which contains all the configurations from AWS SSM parameters and is registered via the AWS Service discovery service shown above. Once the AWS ECS Service gets created and the Service discovery is added, the private internal API is injected via environment variables into other application AWS ECS services. This API will be used to send the telemetry data from the multiple services.

Note: We are using the AWS-managed public AWS ECR Image: public.ecr.aws/aws-observability/aws-otel-collector:v0.12.0 inside the task definition of the above AWS ECS Service.

We would be using AWS SAM CLI for deploying the Cloudformation scripts. This is one of the most common frameworks for deploying AWS Cloudformations scripts as serverless application resources. To use the AWS SAM CLI, we need the following tools:

AWS SAM CLI — Install the AWS SAM CLI.
Docker — Install the Docker community edition.

To build and deploy the application for the first time, we need to run the following:

sam build
sam deploy --guided --capabilities CAPABILITY_NAMED_IAM

Finally, we conclude this section by setting up the AWS Infrastructure required to enable and run AWS ECS Service with Open telemetry and Service discovery.

Note: Implementing the infrastructure using AWS Cloudformation is one of the many ways. Personally, AWS CDK is one of my favorites when it comes to infrastructure development. However, since most of the organizations and development teams still rely on AWS Cloudformation for their Cloud infrastructure landscape, we tried to create the same using AWS Cloudformation as well.

Microservice: Java, Springboot, AWS dependencies

This section will showcase how to instrument the Java application spring boot code to incorporate AWS Xray SDK and enable the tracing feature.

Maven dependencies: We need to add the following dependencies in the pom.xml of the project folder.

2. Sampling rules: Add the tracing rules in the x-ray folder inside the resources section of the source code.

3. X-ray filter tracing bean: This filter definition enables tracing the incoming HTTP requests, when we add this X-Ray servlet filter to the application, the X-Ray SDK for Java creates a segment for each sampled request. This segment includes the timing, method, and disposition of the HTTP request.

4. Annotation: @xrayenabled is being used as an annotation for the Spring boot components to enable the Xray features

5. Tracing 3rd party API calls: We have used Spring cloud FeignClient to demonstrate the HTTP calls from the application and enable tracing on it.

Additional AWS Services like S3 and Dynamodb SDK have been incorporated as well to showcase how the tracing works for the same. More details and advanced techniques can be found at this link.

We are finally done with the instrumentation of the AWS Xray Code inside the Springboot application which will be deployed as an AWS ECS Service and exposed via an HTTP endpoint using AWS Elastic load balancer. Once we deploy the application and start using the endpoint, the spring boot application will start emitting traces and metrics which will get reflected in the AWS X-ray and AWS Cloudwatch dashboard. AWS X-Ray always encrypts traces at rest, and can also be configured to use AWS Key Management Service (AWS KMS) for compliance requirements like PCI and HIPAA.

Running AWS Xray Locally: We can always run the X-Ray daemon on the machine and test out the tracing or metrics, in this case, the application will send the trace data to the local X-Ray daemon and that will, in turn, send it to AWS X-Ray service.

Cloudwatch AWS X-Ray Dashboard

Once we start making some API calls towards the AWS Elastic load balancer endpoint, based on the application architecture and design we can start seeing the tracing details flowing into the AWS Cloudwatch Xray dashboard.

Also, AWS X-ray is aware of the outbound HTTP calls(e.g. towards https://nordcloud.com) from the application, it can identify service boundaries, and can therefore generate a Service Map showing the flow of traffic within our application. We’re able to easily see how one client request invokes three services and an outbound call to AWS S3 and AWS DynamoDB.

There will be multiple calls towards the health check endpoint defined above in the application from the AWS ELB target group configuration and also towards the aws.amazon.com endpoint to retrieve the metadata and temporary credentials for invoking the different AWS services SDK incorporated inside the application.

The individual trace details can be seen below since we have added Springboot pointcut on all our beans that's why the trace below is showing each class and method details handling the user request. Here we can see how different parameters related to the payload delivery like Latency, Faults are being displayed for each of the traces which are being ingested via the endpoint.

We can also trace the exception which might occur while making any changes inside the application. Using the AWS Xray dashboard and instrumenting the AWS SDK code inside the application with the annotation shown above, we can also drill down further to the exact line from where the exception is occurring. For Example, the below image shows the exception that we were getting while trying to deploy the AWS S3 SDK with a different profile credentials profile.

Grafana Managed Dashboard for metrics and tracing

We can also visualize the telemetry data in AWS managed Grafana dashboard. There are multiple ways we can authenticate and set up Grafana (Saml-based and AWS SSO). Amazon Managed Grafana doesn’t just allow us to analyze data from AMP, we can also pull in metrics from self-hosted Prometheus or InfluxDB servers, from a range of AWS services including Amazon CloudWatch, AWS X-Ray, AWS IoT SiteWise, and Amazon Elasticsearch Service as well as from third-party services such as DataDog and Splunk. Based on the solution architecture showcased above, we can see the below examples where the metrics for the AWS ECS Cluster and AWS ECS Services are being presented in the Grafana dashboard.

Cloudwatch AWS ECS Container Metrics

AWS Xray Tracing Metrics in Grafana

As we’ve seen, we now have some really powerful tools available in the form of Amazon Managed Grafana which allow us to pull in metrics from our applications, the runtime platforms they run on as well as the infrastructure that sits below. This makes it so much easier to spot performance issues, correlate errors, and trace transactions through increasingly complex heterogeneous systems, without having to spend time worrying about the scaling or resilience of our observability platform.

Grafana or AWS Cloudwatch

One of the main reasons for using Grafana in most organizations is the kind of flexibility we can get outside of AWS CloudWatch in the storage and manipulation of their metrics. We can spin up Grafana in a few minutes with very limited knowledge, and low ongoing maintenance, and still import our Cloudwatch metrics among other data sources. It also has a huge array of plugins so we can have one dashboard that covers AWS services plus non-AWS ones like Cassandra or MongoDB. It’s a tool that can be classified in the “Monitoring” category. It also supports multi-cloud monitoring with the help of plugins.

Whereas, Cloudwatch is mainly used if the organizations have strict data retention policies enforced. It also helps as a starting point to deep dive into the realm of observability since it covers the basics pretty well. AWS Cloudwatch is classified under the “Cloud Monitoring” category.

Having said the above difference, no tool can be said to be perfect. We must choose the tool based on our application architecture and organizational needs. For example, if we are planning to launch infra in AWS it would be perfect to use AWS CloudWatch for metrics, logs, and dashboards as it already has preconfigured configurations. If we are planning low-cost monitoring and customized multi-cloud dashboards, Grafana is a better choice

Conclusion

Finally, we came to the end of the demo and blog where we tried to showcase the power of AWS Xray to enable tracing within the application and how it helps us to triage multiple factors when it comes to application development and deployment. Especially, Metrics like latency, different container insights as well as Logs visualization help the developer drill down the exact components to improvise.