Observability at tb.lx: the key to our product’s success

tb.lx
tb.lx insider

--

Every month we will give you a shot of knowledge about various topics, including technology, electric vehicles, engineering, business, and even design. We’ve got all types of expertise in our team, and we want to share it with the world. And don’t worry, the doctors approved it, and it’s clinically proven to open your brain’s doors to amazing knowledge.

Observability is one of the biggest cross-team focuses at tb.lx. Our products are deployed across multiple continents and several time zones, and we need to constantly monitor them to ensure that they are performing as expected.

This article will present the core concepts related to Observability, best practices, and common pitfalls.

Monitor to conquer a smooth product launch

Suppose I were to ask you about your typical development cycle that follows an Agile methodology. In that case, you’d probably describe something like this (simplifying to ignore deployments and tests in development and testing environments):
- Gather requirements for a product or new feature;
- Code features;
- Test;
- Deploy to Production;
- Repeat the cycle again for new features.

This cycle is perfectly fine and complete if you are not responsible for operating the product, or if you’re developing a Minimum Value Product (MVP), and you are not in a phase of full use of your product by customers.

However, if you are operating your product and if you have customers using it daily, you can notice a missing step here. How can you ensure that your product is running as expected and that customers are using it?

A more complete cycle would include a new step that helps to ensure a new product is running as planned:

- Gather requirements for the product or new feature;
- Code features;
- Test;
- Deploy to Production;
- Monitor application and observe metrics;
- Repeat cycle again for new features.

This new step, monitor application and observe metrics, should be a continuous effort running along your development cycle.

At tb.lx, we allocate a team member to this job every week, on a weekly rotation. This way we ensure that we deliver a high-quality product, and that possible customer support requests will be answered as quickly as possible.

What is Observability?

What does Observability mean, and what does the added step of monitoring an application and observing metrics look like? Why is it important?

Quoting two of the biggest product companies, Dynatrace and New Relic, that focus on developing Observability tools, Observability is:

- “The ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces.”

As well as,

- “Proactively collecting, visualizing, and applying intelligence to all of your metrics, events, logs, and traces — so you can understand the behavior of your complex digital system.”

We can see from both quotes that multiple types of telemetry data (logs, metrics, and traces) help us understand how our system is behaving.

These types of telemetry data can be described as follows:

  • Metrics — Measurements of how a service or component performs over time. Ex: memory usage, HTTP requests per second.
    - Logs — Records of events that occur on a specific system or application, recorded in plain text, structured data or binary format.
    - Traces — Link of events in a single request or transaction, to provide a complete picture of how it flows from one point to the next.

Why are Observability and Monitoring important?

Investing time in developing good Observability and Monitoring processes within your team has many benefits such as:

- Detecting slowness or downtime in your system, by detecting faulty infrastructure and dependencies;
- Extracting performance metrics for your system;
- Detecting bugs and collecting important information to understand what caused the issue;
- Collecting business metrics to understand if your system is achieving your goals and the adoption of new features;
- Investigating customer support requests and understanding the root cause of their problems.

Keep in mind that finding and fixing bugs early will save you a lot of money, since the cost of finding and fixing bugs increases exponentially in each stage of the product development process.

This is a graph showing the relation between cost of fixing bugs and the time of detection. The later the bug is fixed, the higher the cost
Exponential cost of fixing a bug

Bugs that your customers report in production environments can also negatively affect the relationship with your client, leading to the end of contracts in extreme cases.

Service Level Objectives

Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) are key concepts to follow and monitor against, when talking about Observability.

- SLAs are contractual agreements that you define with your customers. SLAs define what are the expected performance levels of your system and what happens if you fail to meet the expected levels;
- SLOs are objectives that you set within your team taking into account the expected performance level defined in SLAs;
- SLIs are measurable metrics that will help you understand if you’re meeting the goal defined in your SLOs.

To understand the difference between these metrics, consider the examples below:
- SLA — My product will not have a downtime larger than 15 hours per year;
- SLO — My product should have an availability of 99.9% (downtime — 8.77 hours per year);
- SLI — Measurement of uptime for specific components of your product.

To help set better, more realistic SLIs, we recommend reading this Atlassian page, Google Cloud Article, and Google SRE book.

Error Budget

Another important concept is the Error Budget. This concept relates to the number of errors you can accumulate before your customers start being unhappy (measured in a sliding window).

An error budget is typically calculated using one dimension of your service, such as availability. Every time your system is performing as expected you’ll be accumulating a positive error budget, and every time your service is degraded below a certain target you’ll be accumulating a negative error budget.

Suppose your total error budget over a certain period of time is negative. In that case, you must stop new developments, and focus on fixing bugs and improving the reliability of your service. If your total error budget over a certain period is positive, then you can proceed with new developments and push new features to production.

Observability Tools

Now that we have defined the core concepts of Observability, let’s dive into the tools that help you understand and visualize what is happening with your system.

There are multiple types of Observability tools, each serving a purpose. A good Observability and Monitoring process should include the following types of tools: Dashboards, Alerts, APM, and Log Visualization. I will describe each tool in more detail, in the sections below.

Dashboards

Dashboards are a tool that helps visualize metrics that are relevant to your service. You can have a variety of data sources such as Prometheus, metrics extracted via an App Insights agent, databases, and logs (among others).

You should create dashboards that provide you a quick look at how the system is behaving, in order to help you quickly figure out the root causes of problems (just by looking at the dashboard).

Metric collection tools will help you to not only to collect engineering performance metrics, such as the number of requests made to the system, response times, memory and CPU usage, but also to use metric collection tools Software Development Kit (SDK), to add custom metrics that are relevant to understand the behavior of your system, such as business metrics.

Keep in mind that while you can see the last logs for your application using Dashboard tools, these tools are not suited for querying logs.

In tb.lx we are using Grafana.

This is a graphic showing a generic Grafana dashboard with several types of graphs
Example of a Grafana dashboard

Alerts

Alerts are helpful to ensure that your product complies with defined SLOs, and that you are quickly addressing possible incidents. You should define alerts using SLIs, in order to notify you when your product is not performing as expected.

Alerting tools typically let you define alerts using metrics, logs, and other data sources such as database information. Alerting tools also provide you a large range of alert notification actions such as:
- Sending an email/SMS;
- Calling you;
- Using Webhooks to notify people using common workplace chats like Slack and Microsoft Teams.

In tb.lx we are using the Grafana Alert Manager.

This is a graphic showing the Grafana Alert Manager dashboard
Grafana Alert Manager dashboard

APM (Application Performance Monitoring)

APM tools automatically collect traces and metrics of the calls being made to your system, and the communication between different components of your system.

At tb.lx, we are using Microsoft Azure Application Insights (see our post on how to integrate AppInsights with your service).

This is an Azure App Insights application map
Azure App Insights application map

Log Visualisation Tools

Logs provide crucial information that helps you understand what’s happening in your system. You can use them to debug exceptions and errors that happened in the system, by understanding what caused the error, step by step. You can also use them to store relevant information for your business that helps you understand what happened in a specific situation, among many other usages.

An application running production that is being used by a decent number of customers, can easily generate more than 100,000 log entries per day. If you have ever had the experience of opening a large text or JSON file, you’ll know how painfully slow and hard it can be to find specific information inside of it. Therefore, the best way to query log entries and extract relevant information, is to use Log Visualization tools.

At tb.lx, we use Azure Log Analytics and Grafana Loki, varying by team. Both serve as a central place where all logs are stored and allow querying, which helps us find relevant log entries given keywords or specific fields.

This is an Azure Log Analytics dashboard
Azure Log Analytics dashboard
This is a Grafana Loki dashboard
Grafana Loki dashboard

Best Practices for Supporting Customer Requests

In our experience of operating products in production and supporting customer requests, we often try to extract the following key information from customer support requests:
- The time of an incident;
- The origin of an incident — whether it is happening only in specific cases, or if it’s a generic issue;
- The expected behavior (when not explicit);
- The steps that lead to the incident;
- The agent who performed the action that led to the incident.

This key information helps us try to figure out which component of the system is causing the incident, leading to a quicker debugging of the issue and faster resolution or mitigation of the incident.

A good Observability setup can help us figure out what’s happening, even if the information provided is not complete or precise. Without an observability process in place, it will take longer to debug and find the root cause.

In order to set up your Observability tools in the best way possible to solve issues and provide quick feedback to your customers, ensure that you add relevant logs to your code and customize telemetry events containing following information:
- A timestamp;
- The customer ID;
- Other IDs relevant to your product (transaction ID, product ID, etc);
- Trace ID — aggregate all actions with a common ID;
- Identification of the user performing the requests (when relevant).

It is key to be careful not to add sensitive information to your logs and telemetry events.

In the following example of a dummy e-commerce application, we are using structured logging in the JSON format, and we added three custom fields, trace-id, user-id and order-id. When using a log visualization tool, we can query these fields directly, since they are not part of the message object. As a result, we can filter all our logs by a specific user or order.

Moreover, if the trace-id is exposed to the customer and the customer can provide us the trace id, debugging a support request will be much quicker, since we can look at the exact trace that caused the incident.

This is an example of logs with custom fields
Logs with custom fields

Common Pitfalls

In my experience, I came across several Common Pitfalls when setting up Observability tools that lead to incorrect set-ups and inefficient usage of the tools.

The first pitfall is to trust that the framework will collect all data automatically.

While this might be true and most frameworks work out of the box without major customizations, you should always validate that your framework is configured correctly and that your information is collected and stored as expected.

Precision is key for a good Observability setup, so ensure that you are working with the full data set, and that your tool is not throttling requests or sampling data.

A second common pitfall is trusting that your framework logs all relevant data.

Ensure compliance with local data protection and storage laws. As a general rule of thumb, ensure that no private or sensitive information is logged.

In the following dummy example, we see an order being logged and containing the user’s email and the credit card number. Logging this information is a major privacy and even security breach. If you must log certain fields to identify the user and payment method, ensure that you use non-personal and private information, such as ids generated by your system or masked information. In this example we should replace the email with the user id and the credit card with just the last four digits or an id as well.

This is an example of logs with personal information
Logs with personal information

Another important thing to consider about logs is to ensure that your logs are meaningful and objects are correctly displayed. This is an issue in Object Oriented Programming languages, such as Java and Kotlin, if we don’t have an override for the toString() method (Kotlin data classes implement this method).

This is an example of logs with an incorrect class display
Logs with incorrect class display

One last tip about managing logs; check your logging levels and make an effort not to log unnecessary information. You might be using dependencies with lots of logs in DEBUG mode and if you set the minimum logging level of your service to DEBUG, you’ll be flooded with logs from that dependency, alongside all your other logs. As a rule of thumb, we run services in production with a minimum level of INFO, setting the logging level of certain dependencies when relevant.

Configuring correct logging levels will help you find relevant information much faster since you remove all the noise, and will also save you money since most logging tools in the cloud will charge you per GB of usage.

Finally, a third common pitfall is to trust that out of the box metrics are enough to get all the information you need.

This is especially relevant if you’re using a stack like Prometheus and Micrometer dependencies in Spring Boot, which will automatically collect engineering metrics such as memory usage, CPU usage, process uptime, number of requests received and performed as clients, the duration of requests, etc.

While these metrics are great and can help you have insights into how your product is behaving, keep in mind that frameworks don’t know what’s relevant to your business. Relevant information might not be automatically collected, and you’ll need to add information to telemetry events through tags or even by creating your own telemetry events to display missing information.

A good example is the metrics collected out of the box by Prometheus and Micrometer. These tools capture requests but won’t show the payload of the request, only its type and URL.

Picking up on the example of the dummy e-commerce application once more, if you want to filter orders by the user that is performing and the user id comes in the body of the request or in a Token in a header, you will not be able to see it with out of the box metrics.

Prometheus metrics for a POST orders request
Prometheus metrics for a POST orders request

If you want to have this information, you’ll need to add filters to your code to fetch the user id from the payload or token and add it as a tag to the events.

Conclusion

In this article we presented the core concepts of Observability, and why we are focusing so much on this topic at tb.lx. A good Observability setup will help you deliver better products and provide better support to your customers.

We also introduced some best practices to consider when logging information and observing telemetry events, presenting a set of common pitfalls for you to be aware of, and validating that you are efficiently using Observability tools.

This article was written by André Faustino, Senior Software Engineer at tb.lx, based in Lisbon, Portugal. 🚛🌿.

--

--

tb.lx
tb.lx insider

Developing digital solutions for sustainable transportation 🚛🌿 with Daimler Truck. Privacy policy: https://www.tblx.io/privacy-statement