A Beginners Guide to Distributed Systems and the Role of Telemetry, Observability, and Monitoring.

7 min readMay 5, 2023

This article aims to provide a simplified introduction to distributed systems and how they work. It covers topics such as instrumentation, telemetry, Open Telemetry, distributed tracing, observability, and monitoring systems. At the end of the article you should understand what these terms mean.

Instrumentation is adding code to your app to collect data, telemetry is collecting that data and transmitting the data to a monitoring system, tracing is receiving that data for analysis to tell a story that can easily be interpreted and used to troubleshoot.

DISTRIBUTED SYSTEMS

A distributed system is a type of computer system that consists of multiple interconnected computers that work together to achieve a common goal. Lets say, you are building a large scale e-commerce website. You could use multiple web servers to handle incoming requests from users. Each web server could be responsible for a specific function (product search, checkout or product details). You could also have multiple database servers to store and retrieve product information, customer data, e.t.c This is a distributed system. It allows you to distribute the workload across multiple servers, thereby maintaining high performance and reliability. Now building and managing such a system can be complex because you are dealing with multiple components interconnected together. You can imagine how tedious troubleshooting such a complex system would be. At this stage, simply checking the log data just wont suffice because logs will be generated from every application and it will be hard to pinpoint errors in such a complex system. Logs are files that record events, warnings and errors as they occur within a software environment. Logs can also record information about system performance, usage, and other data. How do you now start looking through the logs to find different things? You are just seeing data. How do you turn that data to information? That is where open telemetry, distributed tracing, monitoring comes in.

INSTRUMENTATION

So the first step is instrumentation. This is the process of adding code to your software application to collect data about how its behaving at runtime. You have your application code. You’re now going to undergo instrumentation. Instrumentation is simply adding code to your application that would enable 3rd party applications to be able to read the data that comes in from the logs and send it somewhere.

TELEMETRY

You have put the code that will allow third party applications to be able to collect data from those logs. So that’s where telemetry comes in. Telemetry is a general term which refers to the collection and transmission of data from any type of device or system(a servers, databases and applications). Telemetry is now collecting the data and not only collecting data but sending it to monitoring systems. So those monitoring systems will give you a better view of the problems or the good things that are happening in your application. So you will be able to track what is actually going on in your application once it goes into production. The collection of data is automated. This data can include a wide range of information such as system performance metrics(e.g CPU usage, memory usage), application-level metrics(e.g requests per second, response time), error and log data, user behavior data, etc.

OPEN TELEMETRY

Open Telemetry is a specific opensource project that provides a framework for collecting, processing and exporting telemetry data in a standard format. Access to these organized telemetry data go a long way in troubleshooting applications or servers, identifying patterns and trends and the great thing about telemetry is that it can be done remotely, so it is really beneficial for large scale enterprises. You can be able to monitor your applications and see what is working and what is not working. There are a lot of monitoring systems that you can instrument your code for telemetry to export the data to. Monitoring systems can include Honeycomb, AWS XRAY, Rollbar, Prometheus, Grafana.

DISTRIBUTED TRACING

Distributed tracing is a technique for collecting and analyzing telemetry data from distributed systems. It traces the flow of requests across multiple components and services. You can be able to trace the path of a request. it also provides information about the performance of each component and the communication between them. It is usually integrated with other monitoring and observability tools to provide a comprehensive view of system performance and health. Some monitoring systems have built-in support for distributed tracing, allowing developers to use distributed tracing to collect telemetry data from their systems.

OBSERVABILITY VS MONITORING

They are both used for monitoring and managing the health and performance of complex systems but they still have their differences.

Monitoring just shows you something went wrong observability tries to help you understand WHY that thing went wrong by giving you deeper insights into the thing.

Monitoring involves setting up alerts and notifications to let you know when certain thresholds are exceeded or when particular events occur.

Observability is more about understanding what is happening inside the system than simply tracking metrics or indicators. It involves collecting data from a wide range of sources and using that data to gain a holistic understanding of the system’s behavior.

Monitoring is focused on tracking specific metrics and indicators to detect issues as they arise, while observability is focused on gaining a deep understanding of the system’s behavior and using that understanding to detect and diagnose issues.

Monitoring is watching to see when problems come, observability is understanding the inside of a system to identify and fix problems. From the data you get from observability you can be able to know the root cause of issues by analyzing data in real time not just metrics with no added context.

Monitoring tools focus on collecting data from specific sources and alerting when those sources indicate an issue. Examples include; Rollbar, Prometheus, Grafana, etc .

Observability tools take a broader approach, collecting and analyzing data from multiple sources (e.g. logs, metrics, traces) to provide a more complete picture of system behavior and performance. By collecting and analyzing this information, you gain a detailed understanding of how different parts of the system are interacting and where issues may be occurring. Examples include; Honeycomb, AWS X-Ray

Examples of OBSERVABILITY TOOLS

HONEYCOMB

Honeycomb is specialized for their feature of distributed tracing. Honeycomb also provides features for logging and metrics analysis. You can be able to trace the beginning of a request to the end of the request. Once a user sends a request and there’s an error, how do you know which one is the error? How do you now know where the error is coming from? Is it from the database, is it from the web server? So honeycomb would show you the trace how it’s moved from components to component from database to web server, etc. Honeycomb would give you this full story, how it traveled from the beginning of the request to the end of the request. And when you can be able to see that tracing, when you can be able to see that story, you can be able to know where the error is coming from. Honeycomb allows developers to quickly identify the root cause of issues. It tracks the path of requests as they move through complex systems, identify performance bottlenecks, and troubleshoot issues quickly. Honeycomb analyzes log data, metrics and other telemetry data to provide a complete picture of system behavior.

AWS X-RAY

AWS X-Ray enables developers trace requests as they flow through different components of a system, visualize the relationship between different services and identify performance bottlenecks and errors. It provides a comprehensive view of requests as they flow through a distributed system, making it easier to identify and diagnose issues that may be affecting application performance or reliability.

Examples of MONITORING TOOLS

ROLLBAR

Rollbar specializes in error tracking. Once your application is running, the minute you have an error, rollbar will give you a visual of that error, from there you can be able to try to troubleshoot where the problem is and it makes your life a whole lot easier because you can do it from anywhere. Rollbar is also capable of monitoring and alerting for other types of events, such as log data and performance metrics. Rollbar provides real-time alerts when new errors are detected and offers features for debugging errors directly in the platform. It collects and aggregates error data from different parts of an application, such as front-end and back-end components, and provides a centralized platform for developers to investigate and resolve errors.

AMAZON CLOUDWATCH LOGS

Amazon CloudWatch Logs is used to collect and monitor log data from multiple sources in a centralized location, making it easier to troubleshoot issues and perform root cause analysis. It offers real-time monitoring of logs, log analysis and visualization. It is used to generate graphs and charts that provide a visual representation of log data over time, allowing users to gain insights into system performance and behavior.

PROMETHEUS

Prometheus is used to collect metrics data from various sources such as applications, services, and infrastructure components and provides a powerful query language for analyzing and visualizing this data. Prometheus is mostly used together with other monitoring tools such as grafana for visualization and alert manager for alerting to provide a comprehensive monitoring and observability solution for distributed systems.

GRAFANA

Grafana is a visualization tool used to display time-series data. It supports a variety of data sources, including prometheus, elastic search, etc, and allows users to create and share custom dashboards for visualizing data. Grafana provides a flexible and customizable dashboarding system for creating and sharing dashboards that display metrics data in a variety of formats, including charts, tables and graphs.

Thank you for reading. I hope you found it helpful. Leave a comment if you have any questions, ciao :)