A Deep Dive into OpenTelemetry: Running the OpenTelemetry Demo

Eromosele Akhigbe
AWS in Plain English
13 min readApr 23, 2024

--

In Part 1, we discussed the increasing complexity of modern applications and the challenges of monitoring them effectively. Traditionally, monitoring methods struggled to keep pace with the growth of these applications, leading to hidden problems and confusion in digital operations. However, observability software emerged as a solution, offering a comprehensive approach to monitoring and understanding application behavior.

We also defined Observability as the ability to understand the internal state of a system by examining its outputs. It went beyond monitoring by taking action based on assessments. The article introduced the TEMPLE framework, challenging the traditional “three pillars of observability” and expanding the concept to include traces, events, metrics, profiles, logs, and exceptions. We introduced OpenTelemetry, an observability framework designed to provide insights into the behavior and performance of software applications. It combined two projects, OpenTracing and OpenCensus, into a unified platform, offering APIs, libraries, agents, instrumentation, and standards for collecting telemetry data such as traces, metrics, and logs. Key concepts of OpenTelemetry were also explored, including observability, data collection across multiple signals, spans, context propagation, signals, instrumentation, specification, collector, sampling, semantic conventions, instrumentation scope, and distributions. These concepts aimed to provide developers and operators with the tools needed to monitor and understand complex distributed systems effectively.

To read more on the Part 1, click here

In Part 2, we will be more practical, and by the end of this article, you will learn how to produce your first traces and metrics using the OpenTelemetry demo. Additionally, you will learn how to diagnose errors using traces and metrics generated by the system.

ARCHITECTURE OF OPENTELEMETRY DEMO

OpenTelemetry Demo is composed of microservices written in different programming languages that talk to each other over gRPC and HTTP; and a load generator that uses Locust to fake user traffic.

OpenTelemetry Demo Architecture

Prerequisites

For this demonstration, we’ll utilize an AWS EC2 instance. The rationale behind this choice is the microservice architecture employed in our demo, involving the operation of 18 distinct services. Consequently, to ensure seamless participation, it’s advisable to have a system with >8 GB RAM capacity. I want everyone to be able to follow along without being hindered by inadequate system requirements.

If you do not have an AWS account, you can set it up here

  • AWS account
  • Basic knowledge of Git and Docker

SETTING UP OUR AWS ACCOUNT

  • Login in to your AWS account using the root user
  • Navigate to EC2 instance
  • Lets create our security group for our instance
  • We will create the key-pair that we will use to access the server from our terminal
  • Next, we initialize an Ubuntu instance using the security group and key pair that we created earlier. It’s possible to use a smaller instance type, but the RAM must not be less than 8 GB. I am using t2 x large so i will avoid any lags.
  • If you have done everything right, your instance should be up and running after a few minutes.
  • Connect to the instance following the instructions, ensure you run chmod 400 “otel-demo.pem before you try connecting to the instance
  • Copy and paste the commands to your terminal/powershell

We will be following OpenTelemetry documentation to setup the project and we will be making use of docker. You can follow the process here or you can stick with me.

To install Docker run the following steps:

(Note: I am currently installing Docker on an Ubuntu server. If you are using a different operating system, please ensure to specify the OS you are using when referring to the Docker documentation):

  • Run the following commands on your terminal
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
  • If you installed everything properly, this is what you should see:

Running the Demo

  • Clone the Demo repository by running this command on your terminal
git clone https://github.com/open-telemetry/opentelemetry-demo.git
  • Change to the demo folder:
cd opentelemetry-demo/
  • Start the demo
sudo docker compose up --force-recreate --remove-orphans --detach
  • This should be what you should be seeing
  • After a while all your containers should have been created and started. NOTE: If you kafka container is unhealthy, it means the RAM allocation wasn’t adequate to run the containers well, and if you are using 8 GB ram, you might have to run it several times before it will become healthy, if you are using less than 8GB ram, then you have to allocate more memory to your machine.
  • Our next step is to access the application from our browser using the public IP address given to us by AWS, (Remember the demo app runs on port 8080, so it will be http://*********:8080/)/
  • If everything is done well and properly, you should see this:
  • To generate our traces, we will be using Jaeger, On your browser enter this (http://*********:8080/jaeger/ui/)
  • Lets play a bit with our application, so we can generate some data. Pick a random product, change the currency, then change the quantity of what you want to buy and Add it to cart.
  • Scroll to the bottom of the page and place the order
  • You should see this at the end
  • We have generated some data, so let’s head over to Jaeger to see our traces. We have a total of 19 different services in this application, and each of this services has different operations.
  • For this demo, we will check the traces in our frontend service but feel free to play with the other services
  • As observed in Jaeger, it was possible to add tags to facilitate trace identification. Additionally, you could set the minimum and maximum duration for the traces. Upon generating your traces, you would find them displayed on the right-hand side of the screen.
  • You could randomly select any trace to observe its details. If an error occurred, it would be indicated as such, enabling you to easily identify which part of your application is failing and proceed with debugging.

Using Grafana to generate metrics

  • To access Grafana, you go to the following site (http:/*******:8080/grafana/)
  • Go to the dashboard, and select the demo dashboard that has already been prepared
  • You will see the following options, select the demo dashboard
  • You can toggle the different services available and have access to the span metrics of those services, you can also see the error rates(the percentage of requests that result in errors within a specific timeframe) of those services and the requests rate( the number of requests made to a service within a specific timeframe), the application logs (generated by OpenSearch), and application metrics.
  • You also have a dashboard for the OpenTelemetry collector, which enables monitoring of the various components of the collector. This setup allows for the detection of any failing components, facilitating quick identification and debugging. Specifically, there are separate boards available for monitoring the receiver component, processor component, and exporter component. These boards provide insights into the health and performance of each component, allowing operators to pinpoint issues and take appropriate actions to maintain system stability and reliability.
  • In the OpenTelemetry collector dashboard, we also feature general metrics that provide an overview of the collector’s performance and health. These metrics encompass key indicators such as CPU usage, memory utilization, network traffic, and overall system activity. Monitoring these general metrics allows operators to assess the overall health and resource utilization of the collector, enabling proactive management and optimization of its performance.
  • The third dashboard we will be looking at is the Opentelemetry Collector Data Flow dashboard, to understand how it works, you can read about it here.
  • The fourth dashboard focuses specifically on span metrics generated by the application. This dashboard provides detailed insights into the spans generated by various components of the system. By analyzing span metrics, operators can gain a deeper understanding of the behavior and performance of individual components, enabling them to identify bottlenecks, optimize resource allocation, and troubleshoot issues effectively. This specialized view allows for a more granular examination of the telemetry data produced by different parts of the application, facilitating targeted analysis and optimization efforts

It is also possible to design your own dashboard from scratch and i will show you how:

  • Go back to the general dashboard page and click on “New dashboard”
  • For this training, we will be creating our dashboard from scratch, but it is also possible to import your own dashboard. Click on “Add visualisation”
  • There are different platforms that you can use as your data source but for the purpose of this tutorial, we will be using prometheus
  • After you create a new dashboard, you will see that no data has been generated yet, because you have not given specifications.
  • Select a particular metric you want to track on that particular board
  • After selecting your desired settings, click on “Run queries.” Upon doing so, you can now monitor that particular service on the dashboard. On the right-hand side, you can customize your dashboard according to your preferences.

How To Detect Errors In Your Application

The demo provides several feature flags that you can use to simulate different scenarios. Flag values are stored in the src/flagd/demo.flagd.json file. To enable a flag, change the defaultVariant value in the config file for a given flag to “on”. There are various feature flags but we will be using 2 of them to demonstrate how to pick errors in your components. Feel free to try out the rest here.

  • recommendationServiceCacheFailure: Create a memory leak due to an exponentially growing cache. 1.4x growth, 50% of requests trigger growth.
  • adServiceHighCPU: Trigger high cpu load in the ad service. If you want to demo cpu throttling, set cpu resource limits

recommendationServiceCacheFailure:

  • Go back to your terminal and ensure you are in the opentelemetry demo folder
  • In the opentelemetry demo folder run the command below
vim src/flagd/demo.flagd.json
  • The following file should open
  • Change the recommendationServiceCacheFailure defaultVariant to “on”
  • Navigate to your Jaeger website, and you’ll observe an error span within the traces associated with your recommendation service. This error span indicates a failure or anomaly detected during the execution of operations within the recommendation service. By inspecting this error span, you can gain valuable insights into the nature and context of the error, enabling you to diagnose and address any issues affecting the performance or functionality of the service.
  • You can zoom in to determine the specific details of the error and even access the logs associated with it. This allows for a detailed investigation into the root cause of the error. By analyzing the error logs, you can gain deeper insights into the circumstances surrounding the error occurrence, helping to formulate an effective solution to resolve the issue.
  • If you check your Grafana demo dashboard for the recommendation service, you’ll immediately notice an uptick in both error rates and latency rates. This sudden increase in error rates indicates a rise in the number of errors encountered by the recommendation service, while the spike in latency rates suggests delays in processing requests. These metrics serve as early indicators of potential issues affecting the performance and reliability of the service, prompting the need for further investigation and remedial action.

adServiceHighCPU:

  • Go back to your terminal and ensure you are in the opentelemetry demo folder
  • In the opentelemetry demo folder run the command below
vim src/flagd/demo.flagd.json
  • The following file should open
  • Change the adServiceHighCPU defaultVariant to “on”
  • Upon making this change, head back to your Grafana dashboard. You should observe a noticeable spike in the SpanMetrics dashboard. This spike indicates an increase in the processing time for spans related to the ad service, likely due to the higher CPU usage. By monitoring this spike, you can assess the impact of the change and its effect on the overall performance of the system.
  • You can also check your Jaeger instance to investigate any corresponding changes in the traces associated with the ad service. This allows for a comprehensive analysis of the impact of the change on the system’s behavior and performance.
  • You can also analyze the trace producing the error in Jaeger. This analysis enables you to pinpoint the root cause of the error and understand its impact on the overall system behavior. Additionally, by correlating the error trace with other monitoring metrics in Grafana, you can further contextualize the error and make informed decisions regarding troubleshooting and remediation strategies.
  • You can also choose to experiment with other feature flags, but remember to stop your instance once you’re finished. This way, you avoid incurring unnecessary costs.

Scenarios like these highlight how OpenTelemetry plays a vital role in quickly finding mistakes or bugs in our code. When we combine tools like Jaeger and Grafana, we get a full picture of how our app works and how well it’s doing. This helps us find problems fast, figure out why they happened, and then fix them. It’s like having a clear roadmap to solving issues and making our app run smoothly. With OpenTelemetry, we have the power to keep our app in top shape and make sure our users have a great experience.

Hello readers! I hope you found this demo to be both informative and enjoyable, exceeding your expectations. If you have any questions, suggestions, or feedback, please feel free to leave them in the comments section below. Don’t hesitate to share this demo with your friends and colleagues who are interested in the world of observability.

You can also reach out to me directly via email at akhigbeeromo@gmail.com. Your engagement and support mean a lot to me! Don’t forget to show some love by clapping for the author if you found the content valuable. Thank you for joining me on this journey of exploration and learning!

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--