The Future of Observability

Bogomil Balkansky
Sequoia Capital Publication
6 min readOct 13, 2020

One of the luxuries of my job as a VC is the opportunity to meet exceptional people and to reconnect with old friends. I recently met with a friend who is an engineering leader at one of the high-flying tech companies in San Francisco, and I asked them one simple question: “What’s difficult in your world?” Without skipping a beat their answer was “Observability is a s*** show.” (“Observability” is a set of tools for tracking the health of software environments, and for troubleshooting when things go wrong.)

These words fell on eager ears; I have been tracking the observability space for a few years. Some of my good friends from VMware built SignalFx, and I was an angel investor at Omnition; both companies got acquired by Splunk last year. And at Sequoia we have partnered with Sumo Logic, Lightstep and Wavefront (VMWare).

The bear view on observability is that it will be a quiet space for some time following the recent wave of value creation (Datadog going public, acquisitions by Splunk, Datadog, VMware and others). But I beg to differ. I believe observability is an evergreen area whose collective revenue will only grow over time as subsequent waves of value creation emerge. There are two drivers behind my hypothesis:

  1. Every company has become a software company, and observability is how you keep that software on track. The former CEO of VMware — the incomparable Paul Maritz — used to say that stuff only gets into IT, it never gets out. IT environments are becoming ever more complex, which means that more stuff needs to be monitored.
  2. Despite all the tools available, troubleshooting is incredibly hard. You will never meet an SRE or DevOps person who will tell you that troubleshooting is easy and under control. This means that a company with a better approach in some aspect of observability has a chance of building a sizable business.

Every platform shift — whether at the hardware, infrastructure software or application layer — creates the need and the opportunity to rethink, and likely merge, the observability pillars of monitoring, logging,and tracing. The shifts going on today have already given rise to new types of observability tools:

The shift to microservices created the conditions for companies like SignalFx, Omnition and Lightstep that specialize in monitoring the new generation of cloud-native applications based on microservices. The recent emergence and success of OpenTelemetry makes it easier for instrument apps to collect more data. The combination of an ever-growing number of microservices and the ease of instrumenting apps to collect more data causes the amount of observability data to explode, which requires new ways of storing it and processing it. New time series databases like M3, TimescaleDB and InfluxDB are some of the new ways to tackle that challenge. Cribl is a company pioneering another innovative approach to dealing with the deluge of observability data: the observability pipeline that decouples the collection of data from the ingestion of data into various destinations.

This approach is similar to how Segment.io centralized the collection and transformation of all customer data from any source so that it can be consumed by any destination — be it another operational or analytical system. Cribl is collecting, filtering and enriching data from the various sources, and making intelligent judgments where data is best stored depending on its value.

My crystal ball says that the most interesting next-generation opportunities in observability will be in one of these areas:

1/ Deep integration with CI/CD

New code merges are by far the most frequent source of trouble. Admins have a plethora of tools to monitor infrastructure and application performance, but the cause of failures most often stem from shipping problematic code or data into production. The code may be introducing a regression, a software bug, a bad config change, or shipping some data experiment that should not have been shipped. If observability tools were integrated with CI/CD and chaos engineering tools will make it much easier to measure and debug software.

Some observability tools already provide information on how code-merge events correlate with application performance. We partnered with Lightstep (who is very active in this area) and much of their future roadmap centers on visibility into software deployments. The next-gen evolution of this approach will be to identify which code exactly was problematic. Meanwhile, GitHub and others are working on semantic code understanding. These efforts are currently focused on code search and code security, but if and when semantic code understanding matures, it will open very powerful possibilities to understand which particular piece of newly merged code caused the regression or performance degradation. That development would get us closer to finding the proverbial needle in the haystack, or very quickly identify what needs to be fixed when something has broken.

2/ Data observability

Many modern apps are increasingly data-driven and dependent on machine learning (ML) models. Examples of such data and ML-driven apps are e-commerce recommendations engines, social-network feeds, credit-scoring applications, etc. ML models depend on massive data pipelines which need to be monitored to ensure that data is flowing through them correctly. In addition, the ML models themselves need to be monitored if they are still doing what they are supposed to do — e.g., still producing valid recommendations.

3/ End-to-end unification

The hardest, but potentially the most valuable opportunity, in observability would be to connect the dots up and down the stack — from the business analytics tool all the way down to the lowest level infrastructure — in order to answer the question “what happened?” Consider this familiar scenario: The CFO comes to work in the morning and sees that a key business metric is down by 2% overnight. They send an e-mail to a (random) list of people asking what’s happening. The e-mail spins a flurry of activity. 3 out of 5 times the negative trend reverses itself and no one knows what happened or why. But 2 out of 5 times the trend persists and it takes days or weeks to get to the bottom of it. There are only 3 to 5 people in the company who have enough institutional knowledge of the systems, processes and people to get to the bottom of this. They go around to different teams in a very ad hoc manner, have many conversations, formulate hypotheses, look at a myriad of systems monitoring dashboards to find the right data to prove or disprove the hypotheses.

The observability tool that everyone needs in this scenario starts with the business-level metrics (from whatever BI dashboard sits on the CE/CFO screen), and traces how these metrics are affected by application code and the various tiers of infrastructure.

One of the foundational changes that needs to take place in order to enable end-to-end observability is to unify the observability data and the business data. Currently these are two completely separate worlds: the observability data lives in Splunk, Sumo Logic, or M3, while the business data lives in a data warehouse on prem (Teradata) or in the cloud (Snowflake, Redshift, BigQuery). As a result, it is very hard to tell if a change in a business metric was caused by a technical problem or by some shift in the business.

The next big challenge would be to unify the separate worlds of observability and business analytics tools — at the end of the day they are all about slicing and dicing data in a visual way in order to understand it. I hope in the coming years someone will take on the daring challenge of unifying these two domains.

--

--

Bogomil Balkansky
Sequoia Capital Publication

Partner at @Sequoia investing in enterprise software. 20+ yrs product and marketing leadership @VMware, @GoogleCloud. Diver, cook, photographer.