Observability offers promising benefits. Don’t dismiss it as a buzzword.

Published in

DayBlink Consulting

10 min readJul 5, 2022

Introduction

Modern application systems are significantly more complex than they were a decade ago. Organizations aiming to scale effectively have adopted distributed microservice ecosystems in lieu of monolithic architectures. Microservices allow organizations to deliver efficiently and decrease time to market. This is great for the customer, but innovation often comes with a tradeoff. So where is the cost?

Distributed systems present new challenges to companies that use traditional monitoring tools and techniques. Google defines monitoring as a solution or tool that allows teams to watch and understand a system’s performance over time. Monitoring requires knowledge of what is important to track in advance. To monitor successfully, teams must collect and predefine sets of metrics that reveal how their system is performing.

Still, monitoring is limited because it deals with known unknowns. When one knows which questions to ask but does not have the answers to said questions, monitoring will succeed. If a system fails in similar ways over time, it is easy to develop a playbook to troubleshoot and resolve the issue. However, modern distributed systems fail in unpredictable and unfamiliar ways. What happens when alerts represent an issue no one has seen before? Failures in distributed systems represent unknown unknowns — when one does not even know which questions to ask and solve for. This is where observability can help.

Observability (or o11y) is the ability to gauge the internal states of a system purely by examining its outputs. In an IT context, observability tools filter out noise from different data sources, enabling teams to investigate what caused specific events and pinpoint the source of the problem. For example, imagine a ferry transporting passengers from Athens to Santorini. How much can you glean from the outside about whether the ferry’s engine is working? If the ferry leaves the dock and speeds away into the Mediterranean, you can safely assume the engine is functioning. But you still know little about the engine’s actual health — you would need insight into things like fuel efficiency, top speed, engine temperature, and its emissions. Tools like engine sensors would provide higher observability. Ideally, if there was a problem with the engine, observability would allow the crew to identify the issue’s root cause before passengers even realized something was wrong.

The ferry example can be taken a step further to address the customer experience. How are passengers doing? To gauge this, you would need to collect data on seating comfort and availability, the boarding and deboarding processes, bathroom cleanliness, quality of food and drink options, and more. Maybe this is accomplished via surveys, tracking and examining the questions and complaints heard by staff members, or analyzing on-boat sales data. These activities all increase observability and directly affect passenger satisfaction.

Think of monitoring as an activity, and observability as a larger characteristic of an ecosystem. You cannot achieve observability without doing monitoring. In a world of unknown unknowns, observability tools can provide teams with context, data correlation, and the ability to drill down until a root cause is isolated.

Observability is intended to work with monitoring, not replace it

As a result, more organizations are implementing observability and experiencing positive results. A 2021 Honeycomb survey found that 69% of teams using observability practices can immediately identify when a problem arises and understand its impact on other systems. Observability is a relatively new space with evolving applications, and for every evangelist there is a skeptic. This paper describes observability’s potential benefits and outlines the conditions that would make it feasible for an organization. IT and cybersecurity practitioners should avoid brushing observability aside — as offerings continue to improve, observability will alter the landscapes of both industries.

Key considerations when assessing observability tools

IT and cybersecurity decision-makers should focus on the three areas below when evaluating different observability vendors and their tools:

How adding observability affects the organization’s incident response plans
How well observability tools integrate with the organization’s current tech stack
If system data is high enough in cardinality and dimensionality to make observability suitable

Impacts on incident response plans

Incident response (IR) plans are critical in minimizing damage from security threats. They help teams prepare for, detect, identify, and recover from breaches. Organizations generally have formal Standard Operating Procedures in place to limit panic and wasted time amongst IT and security teams responding to pressing incidents.

For an incident response plan to be effective, the right people need to be alerted about the right types of incidents. The alerts themselves must be actionable, understandable, and acknowledged within agreed-upon timeframes to minimize toil. Once alerted, responders must also know how to effectively use the tools available so that they can acknowledge and resolve incidents rapidly.

During incidents, observability helps by eliminating silos between data sources. When an organization is facing an unfamiliar problem, the presence of aggregated and correlated data can speed up troubleshooting. After incidents, teams hold postmortems to discuss what happened, what the lessons were, and how to adapt for future incidents. Observability adds value here by shining a light on the techniques of attackers. Monitoring explains when a system broke and provides log data about the event. Observability expands on this by providing insight into what happened before, during, and after the incident through real-time, clickable dashboards and health trends. It is the access to these underlying details that teaches a more powerful lesson. Figure 2 illustrates how observability can shorten the time it takes to identify issues compared to monitoring. A reduction in time to identify means a reduction in time to resolve, all else equal.

Observability can decrease the MTTI, decreasing the MTTR as a result

If moving to observability will not disrupt existing incident response processes, then the increased automation of monitoring, discovery, and alerting will help an organization respond to threats more efficiently. Observability platforms continuously monitor threats and dependencies. This tells organizations if they are affected, how widespread the problem is, and what to prioritize in their response.

Even with executive buy-in, implementing observability requires documentation and training plans to educate employees who are accustomed to current tools. Add to that the effort needed to solidify communication channels, assign responsibilities, and update alerting. Such exercises must be institutionalized so teams are prepared when stakes are highest. The level of effort needed to update incident response processes should be closely considered when determining if observability is prudent.

Of course, implementing and achieving true observability is easier when an organization’s current tech stack integrates with a new observability solution — another important consideration.

Ease of integration with current tools and data

Some observability platforms will integrate with your organization’s existing tools more than others. Platforms need to support the languages and structure of your ecosystem in order to provide valuable insights. Without this, observability efforts may fail. If implementing observability requires a redesign of the tech stack, it may be difficult to justify.

Monitoring, Business, DevOps, and Security tools should all integrate with your observability solution. These integrations eliminate context-switching and reduce human error; flipping between different platforms distracts developers and hinders their productivity. Organizations reach greater observability when they feed quality information into their observability tooling. The flip side of this is the concept of garbage in, garbage out (GIGO). Flawed input data leads to flawed output — the results are only as good as what you put in.

To avoid GIGO, organizations can use the framework below for assessing data health in the context of observability:

Freshness: How recent is the data? How frequently are tables updated? Freshness allows organizations to make informed decisions. Decisions based on old data lead to waste.
Distribution: Does the data fall within accepted ranges? Distribution will tell you if your tables can be trusted based on what can be expected from the data.
Volume: Do you have all the data you expected? Volume refers to the completeness of tables and checks if any data is missing.
Schema: What is the schema and how has it changed recently? Changes in data architecture often indicate data issues. It is important to know when things change, who made the changes, and why they did so.
Lineage: What are the upstream and downstream sources affected by a data flow? Lineage clarifies who is generating the data and who is using it to make decisions.

Organizations can utilize this framework to assess data health and prevent GIGO

A good observability solution will work with your organization’s existing tools and input data. A great observability platform connects to existing tools and inputs data quickly and seamlessly. Quality data makes AIOps, or AI-assisted IT operations, possible. AIOps involves the use of advanced machine learning models that analyze data in order to automate tedious, labor-intensive tasks. AIOps technology helps teams detect anomalies and perform root cause analyses — but most valuably, it can automatically suggest and apply fixes. AIOps and observability are closely related and provide maximum visibility when used together. In fact, AIOps tools are included as part of many observability solutions on the market. AIOps capabilities are essential in helping teams find the right questions to ask — bringing them from unknown unknowns to known unknowns.

Resources should not be spent on reconfiguration and maintenance within your ecosystem to fit a new observability solution. Finding the right observability fit requires organizations to be honest and self-aware. Asking questions about current practices and needs is the best route to finding the right observability platform and incorporating it successfully.

Data cardinality and dimensionality

Observability is most effective when data in a system is high in cardinality and dimensionality. Cardinality refers to the number of distinct values in a set. Low-cardinality fields or dimensions are those with few possible values. For example, take a dataset with information on all the books in a public library. A dimension like “print_format” will only have possible values of hardcover, paperback, or digital, making it low-cardinality. Dimensions like “number_of_pages” or “year_of_publication” have much higher cardinality, while a dimension like “ISBN” has the highest cardinality because it is the unique identifier for any given book.

The concept of high-dimensionality is slightly less intuitive. The New Stack explains it well through the lens of observability:

“High-dimensionality” is the sibling of high-cardinality. Think of it this way: the wide structured events that observability is built on are made up of lots of key-value pairs, and cardinality refers to the values (how many of them are you allowed to have per key), while dimensionality refers to the keys (how many of them are you allowed to have per event).

The combination of high-cardinality and high-dimensionality can overwhelm monitoring systems and make data much more costly to store. Organizations not dealing with high-cardinality and high-dimensionality can meet their needs through monitoring and do not require observability tools.

Final Thoughts

The observability space is evolving and growing as it matures. There is plenty of debate about the line between monitoring and observability and what observability’s core tenets actually are. Within the market, different vendors provide different definitions of observability based on their current capabilities. As it stands, much of the information online about observability is marketing in disguise. Objectivity is scarce, so remember to perform your research with healthy skepticism. That being said, major observability players include Splunk, Elastic, New Relic, Dynatrace, along with a host of younger challengers.

Whatever your stance, the discussion around observability and monitoring should not be treated as an “either/or.” While they are distinct concepts, observability does not replace monitoring. Observability complements and augments monitoring. Ultimately, organizations thinking about observability should determine if monitoring is fully meeting their needs and go from there.

If understanding the state of a distributed system is a challenge and monitoring is not helping solve problems quickly, observability might be necessary. If not, adding new observability tools is not the best use of time or resources.

For any questions or comments on the analysis above — please contact:

Brian Gastwirth, Cybersecurity Consultant — Brian.Gastwirth@DayBlink.com

Jacob Armijo, Cybersecurity Manager — Jacob.Armijo@DayBlink.com

Justin Whitaker, Partner— Justin.Whitaker@DayBlink.com

References

(1) Amy-Vogt, Betsy. “Observability trends evolve as market must tackle cybersecurity with automation.” SiliconANGLE, 6 April 2022, https://siliconangle.com/2022/04/06/observability-trends-evolve-market-must-tackle-cybersecurity-automation-stormforgeseries/

(2) Birdsall, Randy. “Log4j vulnerability highlights the value of a combined security and observability approach.” AppDynamics, 15 February 2022, https://www.appdynamics.com/blog/security/log4j-vulnerability-highlights-the-value-of-a-combined-security-and-observability-approach/

(3) “DevOps measurement: Monitoring and observability.” Google Cloud, https://cloud.google.com/architecture/devops/devops-measurement-monitoring-and-observability

(4) Garg, Saurabh, and Kumar Jagannathan. “Improve your application availability with AWS observability solutions | Amazon Web Services.” Amazon AWS, 24 September 2021, https://aws.amazon.com/blogs/mt/improve-your-application-availability-with-aws-observability-solutions/

(5) Miranda, George. “The State of Observability in 2021.” Honeycomb.io, 16 June 2021, https://www.honeycomb.io/blog/state-of-observability-2021/

(6) Morgan, Savannah. “How Observability Helps Troubleshoot Incidents Faster — The New Stack.” The New Stack, 30 March 2022, https://thenewstack.io/how-observability-helps-troubleshoot-incidents-faster/

(7) Moses, Barr. “What Is Data Observability? 5 Pillars You Need To Know.” Monte Carlo Data, 31 March 2022, https://www.montecarlodata.com/blog-what-is-data-observability/

(8) Sigelman, Ben. “How deep systems broke observability.” Lightstep, 14 October 2019, https://lightstep.com/blog/how-deep-systems-broke-observability-and-what-we-can-do-about-it?utm_source=thenewstack&utm_medium=website&utm_campaign=platform

(9) Sigelman, Ben. “Observability will never replace Monitoring (… because it shouldn’t).” Medium, 26 March 2021, https://medium.com/lightstephq/observability-will-never-replace-monitoring-because-it-shouldnt-eeea92c4c5c9