ML Observability — Hype or Here to Stay?

Published in

At the Front Line

13 min readOct 17, 2022

Machine Learning Operations (MLOps) infrastructure is evolving at a phenomenal pace, and ML Observability is becoming a critical component of the stack. But an explosion of tooling has led to a lack of consensus in the space — making it both a challenging and interesting one to explore. I hope this article will provide a useful framework for other investors who venture down the ML Observability rabbit hole.

Below, we’ll delve into:

What is ML Observability and why do we need it?
How Observability tooling has the power to unlock a market.
Why vendor solutions are in pole position to capture the value.
The supporting tailwinds and hurdles we have yet to overcome.
The challenges of an overload in tooling.
Building a moat in a noisy market.

THE CONTEXT: WHY WE NEED ML OBSERVABILITY

ML Training Data is Imperfect

The lifecycle of a Machine Learning project is familiar to many. Scope out the project, collect the relevant data, train the model and deploy that model into production (to extremely over-simplify). Deploying a model into production, however, is not the end of the road.

ML models are highly dependent on the data on which they were trained and training data is by its very nature ‘imperfect’ — it’s bound by time, compute, label availability and subject to bias as a result of human blind spots. Josh Wills describes this shortcoming pretty succinctly¹:

“By definition, when we’re deploying models in production, we cannot properly anticipate and specify all of the behavior that we expect to see. If we could do that, we would just write code to do it, we wouldn’t bother doing machine learning, so, the unexpected is a given”

This unexpected behavior, caused by the messy and infinite nature of real-world data, means that the performance of models out in the wild will inevitably degrade over time. Monitoring and Observability in production are therefore a necessity, not an option and an iterative process for ML is required.

Source: Bridging AI’s Proof-of-Concept to Production Gap — Andrew Ng

Model Performance Degrades in Production

There are a number of reasons ML models can fail in production — from Training-Prod Skew, where your model makes accurate predictions as a proof-of-concept but performs poorly immediately when put into production — to Degenerate Feedback Loops where your ML system’s outputs are subsequently used as inputs to the same system and cause unintended consequences that magnify bias in the model.

I find these concepts easier to comprehend via examples so I’ve included a few simple ones below ⬇

Silent Model Failures Have Huge Commercial Impact

To clarify, the failure-modes listed above are all ‘ML-specific’. They differ from traditional software system failures which tend to focus more on a system’s operational metrics (eg. latency). These traditional system failures are easier to detect as generally they’ll be triggered by a breakage (eg. 404 error). A bunch of APM (application performance monitoring) companies have also emerged over the past decade to help detect these kinds of issues (eg. Datadog, New Relic).

ML models on the other hand, tend to fail silently. The don’t create an ‘error’ and their failures won’t be caught by traditional APM systems. Although ML-specific failures might, in some cases, actually account for a smaller portion of ML-application failures², they are arguably more dangerous as they can have a huge commercial impact. Lina Weichbrodt³ illustrates this well — she describes an incident regarding a fraud detection model they had built for a client, where the unit of measure of an important feature changed from seconds to milliseconds. This change had a significant immediate impact on prediction quality — fraud was not detected, causing significant commercial and reputational damage.

ML Observability as a Solution

ML Observability — the set of processes and tools required to maintain a healthy model in production — has emerged as a discipline to help ML practitioners with the challenges outlined above. There are four types of signals you can monitor to determine the health of your model, each coming with a trade-off in terms of how informative the metric is and how easy it is to obtain⁴.

Source: Full Stack Deep Learning — Deployment & Monitoring

Model Metrics: Measuring prediction vs actual user outcome. These are the ‘gold standard’ of performance metrics. However, ground truth labels (ie. actual outcomes) are often absent or delayed, making these metrics challenging to measure. For most use cases, there’s a lag between prediction and outcome. This could be days or in some cases months/years. For example, consider a model that is built to predict the creditworthiness of a customer. The ground truth in this case (ie. whether the person pays back the loan or not) will only be available at the end of the loan period — years after the prediction is made. Human labeling of outcomes can also lead to quality control issues. If ground truth labels are not available, the next best signal to monitor is outputs via proxy metrics.
Business Metrics: The purpose of a model is to produce some business value — as such, it is important to capture how model performance impacts business KPIs. Lina³ suggests you should track fear signals by asking business stakeholders what the worst case scenario is and converting these fears into metrics (eg. loan prediction model: stakeholder fear is unfairly rejected applications → set alert for when precision <95%)
Model Inputs: Model inputs are easier to monitor but a drop in performance of an input doesn’t necessarily indicate that your model is failing. In a study conducted by Shreya Shankar⁹, it was found that one of the most common pain points among the 18 MLEs surveyed was ‘alert fatigue’ — that is, a surplus of alerts triggered when overall model performance was actually adequate. Separating the signal from the noise can be challenging. For this reason, monitoring inputs should be used for debugging and monitoring outputs for alerting.
System Performance: This is essentially traditional software monitoring (discussed above).

THE INSIGHTS

1. Observability Tooling Unlocks a Market

ML Observability is still an extremely nascent discipline and therefore [insert trivial Gartner market size here] is likely to be an inaccurate and irrelevant representation of the size of this market. I believe that the tooling in this space has the power to unlock a huge market over time via the following flywheel:

ML Observability helps maintain long-term model quality → more models are therefore deployed into production → this creates a greater need for Observability tools.

Although it’s clear that Observability helps maintain model quality, if that quality can’t be communicated to business teams in terms of demonstrating ROI, what use is it?

The purpose of an ML model, is not to generate a top tier accuracy score, rather to generate some business value, whether that’s driving efficiencies or improving products. ML models are only as useful as the business value they create and it’s clear that executives are struggling to quantify this value. A survey carried out by Arize⁵, found that 54% of ML teams report issues with business execs often not being able to quantify the AI ROI. This presents a challenge in terms of justifying continued investment, particularly against the backdrop of the current macro climate.

Business KPIs are unique and so tooling can only go so far in helping to connect them to model metrics. However, the implementation of a robust Monitoring and Observability suite should enable detection of model issues proactively before KPIs are adversely impacted.

2. Undifferentiated Heavy Lifting + Scarce Machine Learning Engineers (MLEs) = Vendor Opportunity

For teams that are now productionizing ML models, although some may still be ‘flying blind’, most now clearly understand the importance of Observability in maintaining a performant, healthy model in production. Although there is consensus that this is a critical step in the ML-process, there is still very much a divergence of opinion on how this is achieved.

In O’Reilly’s 2021 AI Adoption Report⁶, 46% of practitioners report that they have built their own tools and pipelines for deployment and monitoring as opposed to adopting a 3rd party tool. These hacked-together solutions however, generally start to show cracks at scale, with 26% of ML practitioners admitting that it takes them one week or more to detect and fix an issue⁵.

Home-grown Observability tooling also requires a multi-disciplinary team to build and maintain. One of the greatest challenges in ML adoption is currently the talent shortage. Although the number of Machine Learning Engineers (MLE) is growing, demand is clearly outstripping supply. O’Reilly’s survey quoted a lack of skilled people and difficulty hiring as the top challenge for AI adoption⁶. 3rd party vendor solutions can help plug this gap in the short term — speeding up model deployment whilst reducing the operational burden on scarce MLEs.

When every team starts building the same internal infrastructure tooling that is non-core to their business, and those with the skills to build it are in short supply, I see this as a prime opportunity for a vendor solution to eliminate the burden of undifferentiated heavy lifting.

3. We’re At The Beginning of a Long Road

MLOps (the set of practices that aims to deploy & maintain ML models in production reliably and efficiently) more broadly has attracted a lot of attention (+ VC $$) over the past 5 years. It’s certainly maturing but it will take time, education and executive buy-in to get to mainstream adoption.

Data Maturity as a Precursor to ML: For ML-powered products, the code is now becoming a commodity and the data the differentiator. It is the data that dictates the value of the model. As such, having mature data infrastructure and processes is a precursor to companies productionizing ML. Over the past decade, we’ve seen an explosion of companies making up the modern data stack (eg. Snowflake, dbt, Fivetran, Monte Carlo etc). This infrastructure, and arguably more importantly, the concurrent ramping of focus on data and executive buy-in within businesses illustrates that companies are on their way to understanding their data and how to extract value from it (albeit we’re still early in this process).

“Data is the new code. If you compare traditional software versus AI software, in traditional software, the lifeblood is really the code. In AI/ML, the lifeblood is really the data” — Alexandr Wang

An Incremental Progression to Productionizing ML: If you’ve spent any time down the MLOps rabbit hole, you’ll likely recognize the canonical stat that ‘87% of ML models never make it to production’. Since 2017, this statement has offered the perfect marketing hook to validate the ‘why now’ question as it relates to ML Ops tooling. Unfortunately, this stat is not only outdated, but was in fact never rooted in any form of research⁷. So it begs the question — where are we in the maturity of ML Ops adoption? Recent studies show that the number of companies with 50+ models in production likely sits around 35% (Algorithmia quotes 34%⁸ whilst Arize quotes 36%⁵). We’re certainly moving in the right direction and the explosion of tooling across the MLOps lifecycle has aided this progress. Interestingly, these stats can still be somewhat misleading and could actually underestimate the maturity of ML Ops today⁹:

“ML engineering, as a discipline, is highly experimental and iterative in nature, especially compared to typical software engineering. Contrary to popular negative sentiment around the large numbers of experiments and models that don’t make it to production, we found that it’s actually okay for experiments and models not to make it to production. What matters is making sure ideas can be prototyped and validated quickly — so that bad ones can be pruned away immediately”.

People + Process + Technology: The progress made to date in the world of ML has primarily been driven by the development of enabling tools, be it Data or MLOps. This infrastructure is a necessary foundation. We’re now reaching a point in time however, where a lack of tooling is no longer the main barrier to progress. The fundamental hurdle we have yet to overcome is obtaining and maintaining organization-wide buy-in. Solving the [people + process] aspect of the equation will require: 1) Educating the entire organization so they understand what it means to work with ML¹⁰ and 2) Creating an org structure that maximizes the potential for ROI on ML projects. This could mean embedding data scientists/MLEs within business teams or ensuring the voice of ML is represented at the C-Suite level.

Culture and organizational change takes time. The slow burn of DevOps over the past decade illustrates this. The presence of an abundance of MLOps tooling creates a somewhat deceptive picture of where we’re at. Bottom line — it’s early days for MLOps.

Source: Another tool won’t fix your MLOps problems — David Hershey

4. Mo’ Tools, Mo’ Problems!

Three years ago, companies that wanted to implement Observability had no choice but to build a tool from the ground up. Today, there’s a swath of vendor solutions available. This overload of tooling creates some challenges:

There’s a lack of consensus on which is the ‘right’ tool for the job and the high cost of evaluating these tools from the outside creates some adoption friction.
There’s a risk that these products become somewhat commoditized, leading to a race to the bottom on price.
In the current macro-climate, budget for non-critical tooling will shrink and so some consolidation of tooling is likely.

Link to an Airtable with more details. Although all companies listed offer monitoring/observability, each may have a slightly different core focus (eg. some offer explainability or continuous learning capabilities).

I’m bullish, however, that ML Observability will be considered a critical part of the stack and that some companies will endure, retaining a long-term place in the market.

5. Finding the Needle in the Haystack

In such a crowded noisy market, the question thus becomes: how, as a vendor solution, can you build a moat around your business? In a panel hosted by DeepLearningAI¹¹, Chip Huyen points out that:

“The requirement for MLOps tooling depends a lot on company size, use case and maturity”

This holds true for Observability tooling. I find it useful to break down the customer landscape into four main categories (inspiration here from Leigh-Marie Braswell ¹²). Each category varies in terms of the level of tooling sophistication required, maturity of MLOps practices and willingness to spend. In order to build a truly defensible product, companies should consider how they can build a wedge by delighting and becoming critical ‘painkiller’ infrastructure for one of these segments.

For Leaders (eg. Uber, Tesla), ML is mission-critical. They have a large number of models in production with mature MLOps practices. They require sophisticated tooling and willingness to spend is high.

For Innovators (eg. Santander, Allianz), ML is still mission-critical but generally they have fewer use cases and therefore models in production. This lack of scale means the need for Observability tooling is lower along with willingness to spend. Legacy data and infrastructure tooling is a barrier for this category to progress from an ML-maturity standpoint. AI regulation is the main potential driver for this group to adopt tooling.

The Experimenters (eg. Sephora, Zalando) represent the ML’ification of everything. There are a large number of potential use cases and so the need for Observability tooling exists at scale. Willingness to spend is lower as although ML improves the product, it isn’t ‘mission-critical’. An inability to attract high-caliber ML talent is a barrier for this category to progress from an ML-maturity standpoint. As more AI-native startups emerge, the baseline expectations from customers will rise which may force companies in this category to shift and become leaders.

The Laggards are generally not using ML and would be slow to adopt tooling.

THE CONCLUSION

The Hard Thing about the Hard Things

Maintaining a healthy model in production is hard. There’s a multitude of questions to consider — What metrics can be relied upon? What criteria should trigger alerts? How to respond fast to drops in performance? How to know when and on what data a model should be retrained?

Observability tooling will certainly augment ML practitioners in some of these areas, however these tools shouldn’t be viewed as a magic bullet. Simultaneous maturing of processes and commitment from the higher echelons of the organization is required. Education will be critical — both internal (teaching non-technical functions) and external (sharing best practices across organizations — Demetrios Brinkmann and the MLOps Community¹³ are contributing greatly to this effort).

I’m excited to see the progress that will be made in Enterprise ML adoption in the coming years and the foundational role that Observability will play. If you’re building in this space, I’d love to hear from you at ruth@frontline.vc or @ruth_sheridan_ 👋

Thanks to Aashay, Adele, Alex, David, Fergal, Jamie, Lina, Mark, Neil, Paul, Priyanka, Sarah, Stephen for all your thoughts & feedback!