Observability: If It Cannot Be Measured, It Cannot Be Improved

Published in

Flux IT Thoughts

5 min readAug 19, 2024

When we think about observability, we tend to focus on logs, metrics, and errors, but in reality, it encompasses much more than that. Understanding how a product works, beyond its visible features, helps us identify what needs improvement and how to prioritize such improvements.

Can the economic impact of code be measured beyond the feature that has been developed?

Well, this is no easy task. Especially, because we need to understand what we mean by impact on our side, and what is perceived as business impact. We can put this into practice by asking ourselves some basic questions:

Does what I am developing involve direct interaction with the business flow?
Am I automating a work process that would previously take a lot of time? If so, how has the execution time of that process been improved? Can it be measured in terms of the amount of money that has been generated or saved?

Let’s consider some concrete examples:

While browsing Twitter (X), I found a hilarious interaction between the CEO and founder of Shopify (a SaaS e-commerce platform that operates even in Argentina) and a Nike customer who was trying to purchase items from their store. The customer mentioned that it took a total of 7 minutes to complete the purchase process, so this person decided to express his/her frustration on this social network, suggesting that Nike should migrate its e-commerce to Shopify.

According to Digital Commerce 360’s website, by the end of the fiscal year 2024 (which ranges from April 2023 to April 2024), Nike achieved a total revenue of $51 billion. This is undoubtedly an impressive figure, but one that is always expected for a brand that operates in every continent and that has offices in 45 countries.

Most probably, Nike has highly refined observability processes within its own e-commerce platform that helps the company to make decisions so as to improve their products and the overall user experience. But let’s imagine a chaotic scenario where the problem to be solved is extremely challenging and it takes days to fix because it involves multiple teams, including back-end, front-end, infrastructure, DevOps, DevSecOps, and QA teams. Let’s also imagine that this issue occurs in their main e-commerce platform, the one in the United States, where it has the highest demand.

In this scenario, it would probably take two days to solve the issue with all the teams involved and to deliver the solution to the production environment so that its customers could shop normally again. At that point, observability would allow us to identify the company’s losses by reviewing metrics.

We could consider an implementation that enables us to identify how our app functions (both in terms of front end and back end) by means of a specific process.

Hereunder, we are looking at a chart that illustrates how we could implement metrics for this particular issue.

With this data, we can ask ourselves some important questions.

If we were certain that over these two days, 50,000 users were unable to complete a purchase on Nike’s e-commerce platform in the U.S., which would represent 3% of their annual net sales, we would be able to do this calculation:

Total annual sales: $51,000,000,000.

If we calculate the 3% of that amount (0.03 multiplied by $51,000,000,000), we obtain as a result:

$1,530,000,000.

How could we calculate the total loss in sales? Well, if the average purchase value per user was $1,000, we would have to then multiply 50,000 by 1,000, and that would result in a loss of $50,000,000 (fifty million dollars).

Now, let’s ask ourselves:

Does what I am developing involve direct interaction with the business flow?

Answer: The solution allowed users to return to shopping normally, so it directly interacts with the business flow.

Am I automating a work process that would previously take a lot of time? If so, how has the execution time of that process been improved? Can it be measured in terms of the amount of money that has been generated or saved?

Answer: We are not automating a process. The execution time of that process has improved from 7 minutes to a normal e-commerce purchase flow. In terms of money, we can say that we have avoided a loss of $25,000,000 per day, which makes it highly valuable for the company.

What role does traceability play in this solution?

With proper tracing implementation, we can have a fairly complete context of why the bug occurred, where it originated, and what the key factors were for its appearance. It would not be correct to say that this error will replicate in every aspect of the app. Most likely, it only happens in the mobile version of the e-commerce platform, in a specific section, or it could be a microservice that was overwhelmed by the number of requests it received — anyway, there are many possibilities.

In general terms, how does traceability and trace propagation work?

This is done through something called a “span,” which, in theoretical terms, is the representation of a process, and it can start in one place and end in another one. For example, a span could start with a user request from the front end and reach a URL in an API written in Golang (GO), which uses another microservice built in Node.js so as to process data from a database and hand back a report. Yes, it sounds a bit confusing, so let’s illustrate it.

This process is known as propagation, and it is essential if we want to understand all of our app’s states and processes while providing meaningful context that helps us identify where a problem lies without the need to guess.

Certainly, Nike’s team must have used observability to quickly respond to the issue we mentioned earlier, especially if it affected many users.

This example demonstrates that observability has an impact far beyond the technical aspects. When properly implemented, it can speed up improvement processes or the development of new features that are crucial for the company.

Know more about Flux IT: Website · Instagram · LinkedIn · Twitter · Dribbble · Breezy

Observability: If It Cannot Be Measured, It Cannot Be Improved

Written by Nahuel Segovia