Observability n.0?

Published in

The Thesis

9 min readJan 22, 2024

Introduction

Software Observability is serious business. The average cost of downtime for large companies is often upwards of $5 million/ hr; Delta famously lost $150 million during just 5 hours of downtime. No wonder we have several SaaS giants like DataDog and Dynatrace bringing in billions of dollars of revenue each year! However, not all is rosy in the land of observability.

Almost every day, there is a new Twitter/ X thread about engineering teams being unhappy with their observability stack.

For starters, it is expensive. Companies sometimes spend as much on observability tooling as they do on the underlying software infra they are observing! There is of course this legendary thread about DataDog sending a crypto-exchange a bill of $65m, but even in general, observability bills running into a few millions of dollars annually is not abnormal.

What’s more, despite expensive tooling, the experience of DevOps teams and SREs is complex, clunky, and often fragmented across multiple tools/panes. Also, users need to invest significant effort in learning a new tool-specific query language (QL).

Then is this space ripe for disruptors? Could new challengers entering the market gain sizable share from incumbents? Can this space support a few more unicorns, and what would a future winner in observability look like today?

In this article, I explore answers to these questions.

Note: the focus of this article is only Observability of software infrastructure and applications — and does not cover Data Observability, or MLOps Observability.

Key components of the Observability stack

The diagram below outlines the broad segments within Observability. The stack begins with monitoring the performance of applications, and underlying network and infrastructure that hosts them; then extending into alerting and incident management in case of anomalies, and finally communicating the status to your stakeholders.

These systems create 4 categories of observability data which can be roughly classified as: Metrics, Events, Logs, and Traces (or MELT).

Metrics are quantifiable measurements providing an indication of the health of a system. Because they are aggregated data, they are often the fastest and cheapest way to understand how the system is performing.
Events describe all information about what it took for a service to perform a particular job. They are often confused with Logs, but logs are usually only a portion of events. A group of logs could compose an event.
Logs provide a descriptive record of the system’s behavior at a given time, serving as an essential tool for debugging. They are essentially lines of text a system produces when certain code gets executed.
A Trace is the complete journey of a request or workflow as it moves from one part of the system to another. It is achieved by adding a standard trace id as the request/action flows through all the hops.

This article by ChaosSearch is a good “explained simply” resource to go into a bit more detail on MELT

Market Size

TAM estimates for observability range between $20 –40B, of which the top public companies already account for ~$8B revenues. The top companies in this space have been growing at 20%+ YoY even at upwards of a billion dollars in revenue.

Market Map and Key Players in the space

There are several ways to segment the players: scope of product offering, open/closed source, SMB/mid-market/enterprise etc.

In the market-map below, I’ve segmented them based on their product offering. It’s noteworthy that most of the players in the table above already offer a full suite of products:

(Tail)Winds of Change

The Observability space has gotten interesting to me because of the coming together of 5 key trends:

Consolidation of observability tooling: Companies are going after more bundled Observability tools, rather than several point solutions. DataDog’s impressive 30%+ growth in Q3’23 points to this trend in the short-term, but I believe even in the long term companies will look to maintain an Observability stack of 1–3 solutions. This is currently driven by ballooning costs of observability tooling. But another important reason is the difficulty in maintaining a fragmented observability stack, ultimately leading to high Mean Time to Detect/Remediate (MTTD/MTTR). ~60%+ companies that I spoke to are looking to consolidate at least some part of their stack. New companies might have an edge here if they can provide this end-to-end offering but without the hefty price-tags by incumbents.
The rise of completely cloud-native tech-stacks: Many companies that began in the past 8–10 years have built completely cloud-native tech-stacks, based on containers and microservices architecture. These companies require a very different observability solution compared to tools built for the on-prem, or hybrid cloud world.
a) Complexity: From thousands of VMs to millions of ephemeral containers, the complexity of infrastructure and applications to be monitored has exploded
b) Cardinality: This increased scale and complexity also produces higher cardinality data. Observability solutions now need databases which can handle this data effectively as well as cost-efficiently
c) Flexibility: Developers now require greater choice and control over which metrics/data they collect
OpenTelemetry’s coming of age: OpenTelemetry (OTel) is an open standard for tracing and metrics data, enabling customers to use the same underlying data and structures regardless of which Observability solutions leverage them. Further, OTel agents can direct this standardized data to any destination. Due to these characteristics, OTel helps companies avoid getting locked-into a single Observability vendor’s solution, and is hence becoming quite popular. Tracing especially is seeing rapid adoption of OTel — it is the newest of data-types and hence has lower baggage of history, and it also fundamentally requires the same standards to be applied across all microservices. While we have been hearing of the promise of OTel for a few years, I believe it is now at its inflection point: different client libraries across languages are now instrumented, all collection agents are now mostly available, and there is a push from both clients and vendors who are all making contributions to this standard and providing support for it. Yes, there are still some issues with it, but 40% of the companies I spoke to have expressed an interest in adopting it — and I expect this figure to only go higher.
M&A activity: The first half of 2023 saw three large M&A deals in this space: Sumo Logic got acquired by Francisco partners, then 6 months later, New Relic got acquired by the same firm; and finally Cisco acquired Splunk. These deals have fueled several concerns around vendor lock-in (see). These acquisitions again signal the need for unified Observability solutions (e.g. combine Sumo Logic’s logs and New Relic’s traces/metrics)
Innovations in Databases: The advent of distributed columnar databases like ClickHouse has changed the game, by making ingestion and aggregation for high cardinality, high dimensionality data lightning fast, and also cost-effective. They are turbocharging the next generation of Observability Tools (a significant part of observability is cost and speed in log/data management).

What do winning solutions look like?

Enterprise-focus: While software builders of any scale need Observability, I believe most problems worth solving for in this space (i.e. have significant potential for value capture) come with scale. High cardinality data, exploding volumes of data, a complex labyrinth of metrics, and need for distributed tracing all require robust tooling at enterprise scale. Case in point, 86% of DataDog’s ARR comes from just their top 3100 customers (~10% of their customer base). So demonstrated ability to win over large enterprise clients (Fortune5000) is a critical signal.
OpenTelemetry-native: Supporting Open Standards is not a choice anymore, it’s a must-have. I am convinced that the next breakout Observability company will be OTel-native — and with a high likelihood, also Open Source. Buyers won’t trust you till they can look under the hood and get the comfort that they can maintain the source code if your company ceases to exist. Essentially, enterprises are unlikely to replace a DataDog with another DataDog-like closed system.
Comprehensive offering: I believe winning solutions will need to support all Metrics, Events, Logs, Traces on a single pane. Not just that, ideally the platform architecture should be able to support additional capabilities like alerting, incident management and statuspaging in the roadmap down the line. Most enterprise buyers are demanding a respite from maintaining complex integrations between multiple solutions, and companies which can provide an end-to-end solution will have a higher chance of winning. Jamin Ball describes DataDog’s expanding product suite as a revenue-resilience strategy, but for new providers it’s more than that — a comprehensive offering is imperative to even get a foot in the door. One caveat here: a few companies I spoke to mentioned that they want to engage at least 2 vendors in their observability stack, to avoid over-dependence on one (e.g. what happens if their infra is down?). This points towards a need to be comprehensive, yet modular as a product-suite.
“Cut through the crowd”: Often enterprise Observability Stack migrations happen when there’s a “trigger” — a major downtime which cost them millions of dollars of revenue, or a disruption which caused a big account to churn, or a $3m Observability bill. It is essential that when the trigger point does happen, you are present saliently in the consideration set as THE top-of-mind challenger (e.g. synonymous with OpenTelemetry native observability).
Migration support: 90%+ of the enterprise customers I spoke to were unhappy with their existing Observability stack, but >80% of them had not actually migrated away. Migration is a multi-quarter resource-intensive undertaking which requires a massive change management process. It is not just about replicating logs-schemas, and dashboards but also about building muscle-memory of users all over again. (A CTO I spoke to confessed that he is very unhappy with the cost-performance equation of their existing observability stack, but is afraid that their SREs will be up in arms against them if they migrate to a new vendor). Companies can improve their chances of winning a customer, if they can make this migration process as seamless as possible. OTel, transfer of metrics and traces to a new destination is table-stakes. But can companies do more? A major part of the migration effort is recreating dashboards on the new platform — can companies provide scripts that help with 1-click replication of dashboards? Can companies reduce onboarding/learning effort by being compatible with most popular query languages, and interfaces (e.g. PromQL)?
Opinionated tooling with a Seamless user-experience: Ultimately the goal of all Observability tooling, is to identify potential irregularities that could lead to downtime before it happens, and in case of a downtime, reduce MTTD/MTTR. To that end, SRE and on-call engineer experience matters a lot. Winning solutions will need to go beyond just providing a single pane for MELT, to really supercharging the SREs with an easy-to-navigate and Figma-like collaborative UX. AI-Copilots could be an interesting add-on opportunity here (right now it’s a good to have, not mandatory). Another area I see some companies working towards, is automatically fixing issues rather than just detecting them. If we enter a new world where OTel makes basic observability tooling (detect/identify) commoditized, this could be a big differentiating factor.

Note: I don’t mention “cost” here, because I believe it will be table-stakes rather than a winning hand. New companies won’t survive if they haven’t optimized for data-storage costs (fast DBs, potentially object storage, deep research on compressions) and are willing to pass on the cost-advantages to customers in the form of transparent, predictable, usage-based pricing.

The Observability space is going to be incredibly exciting in the coming years, and I am keen to meet companies building in this space! To chat on all things Observability, connect with me at sahil@un-bound.com or Twitter/ X @sahilpatwa