Delivering machine learning at scale: Lessons from TWIMLCon 2021

Published in

Acrisure Technology Group

8 min readMar 1, 2021

In January, the AI Engineering team at Acrisure Technology Group attended TWIMLCon, a leading MLOps and enterprise AI virtual conference. Speakers from AI leaders such as Spotify, Google, and Netflix as well as Tecton and Fiddler shared their insights on building and deploying AI systems efficiently and at scale. As an AI-first company that is building a business around the concept of an “AI Factory,” we found the conference highly informative.

Here are our 3 favorite talks along with key takeaways.

How Spotify does ML at scale

Spotify led a massive shift in a “sleepy” industry (music distribution) and continues to leverage machine learning extensively to improve its products. Josh Baer, Spotify Product Lead for their ML platform, and Aman Khan, Spotify Product Manager presented on building a ML platform that serves 50 distinct teams, trains 82 models every day, and handles 300k prediction requests every second. They built this with 30 people over 3–4 years.

The problem they identified was that ML researchers and engineers spend 75% of their time learning how a certain product feature was built, understanding measurements, and — then — fighting with Bigtable. This leaves 25% (or less) of their time to actually do the “fun” part: measurably improving user experience. In an ideal world, the ML platform team abstracts away parts of the system and makes it easy to build impactful, reliable ML products. The core metric for the ML platform team is “number of model iterations per week,” which they call “ML Productivity.”

In the process of building their platform, Spotify’s product team learned three key things:

Build infrastructure alongside the people who will use it. In other words, solve the correct problem. Spotify has a dedicated ML engagement team to educate, train, and consult with other teams on how to best use their platform. This has helped many people familiarize themselves with the ML Platform and iterate more quickly. Spotify would rather have a team innovating and building something outside their platform and then productize it vs the other way around.
Be opinionated. This creates a more cohesive experience.
Make difficult tradeoffs. For instance, Spotify’s Product Team optimized for the production use case, and let ML engineers handle R&D themselves.

Machine learning experiments at Yelp

This talk by Justin Norman, the VP of Data Science and Data Products at Yelp, shared insights on how an established, data-focused company makes decisions at scale.

The talk made a key distinction between testing and experimentation. In Yelp’s terminology, “experimentation is used to test a hypothesis or make a discovery, where testing is what’s done before widespread use.”

Testing

At a high level, Yelp leverages off-the-shelf solutions whenever possible for their ML Platform, as there is already a mature toolset in the open source community.

Yelp captures the key points of the ML development cycle. They claim that:

“ML Developers need to:

Create a snapshot of model code, dependencies, and configuration necessary to train the model.
Keep track of the current and past versions of data, associated images and any other artifacts.
Build and execute each training run in an isolated environment (prob. container).
Track specified model metrics, performance, and model artifacts.
Inspect, compare and evaluate prior models.”

Experimentation

Yelp has a rigorous process that begins by defining an experiment, continues to production testing of the hypothesis, and ultimately links outcomes to business-level KPIs. Yelp defines two primary experiment types, based on “how the experiment will impact the ship/no ship decision:

Multivariate Experiment (a.k.a. A/B): “I expect the feature to have an impact on our users — I’ll ship in the event of success and won’t ship in the event of failure.” (most common)
Roll-out: “I do not expect the feature to have an impact on our users, but I’ll ship either way as long as I don’t break anything major.” (rarer)”

Both experiment types are run through the same framework, which Yelp calls Bunsen. Bunsen tracks the experiment through its lifecycle, determining how frequently to serve new models based on previous performance and user context. Unfortunately, Bunsen is not available publicly, and it’s unclear if Yelp plans to make the tool open-source in the future.

To ensure models are maximizing business KPIs, all experiments are run based on hypotheses using a standard form: “If we [build X], it will provide [Y meaningful/measurable change].” The hypothesis and measurement metrics are developed in collaboration between product managers and data scientists.

“Product managers choose decision metrics: Your primary metric should be stated in your hypothesis. Secondary metrics are other non-guardrail metrics of interest.
Data scientists consult and sign-off on metrics: Kick off the conversation with feature developers to articulate the data needed for metric computation and understand which events will be logged.”

Once the hypothesis is defined and the experiment launched, Bunsen manages the experiment, outputs a scorecard reporting metrics (e.g., model lift), verifies the model doesn’t break any internal policies, and compares the results to other experiments. Bunsen uses multi-armed bandits to algorithmically determine how often a given model is served over the course of an experiment.

Yelp has a disciplined approach to model testing and experimentation. Their data-first philosophy and their experimentation framework, Bunsen, allows them to run hundreds of experiments simultaneously and select only new models and features that positively impact the bottom line.

Using Fiddler for monitoring AI

Fiddler is a tool that “continuously monitor[s] the key operational challenges in AI: data drift, outliers and model decay.” After highlighting the most error-prone components in deployed AI systems (data, features, models, and business impact), the Fiddler team discussed how a degradation in any of these components can be monitored and addressed.

Data

The two primary threats to deployed AI systems from data are “data drift” and “data bias.” Fiddler addresses these threats by monitoring data distributions, triggering alerts when observed values in out-of-sample data begin to violate expectations based on in-sample distributions.

Features

Feature processing can pose a threat to deployed AI systems if data transformations behave unexpectedly. The Fiddler team recommends monitoring distributions of the transformed feature values in addition to the raw input data. This will identify unexpected behavior if it occurs anywhere in an AI pipeline. Fiddler points out that “data pipeline issues” can cause problems downstream, but that these issues are typically addressed by monitoring (e.g., schema changes to input tables), or by including metrics related to nullness in the data-distribution monitoring.

Models

Model performance decay is the best indicator of issues like concept drift. However, gathering labels introduces a delay, and by the time poor performance is detected, it may have already caused damage. Monitoring the predicted values themselves can provide an advanced indication of a possible upcoming performance decay. It is important to note that a drift in predicted values does not necessarily mean that something is wrong, as changes may be caused by seasonality or other expected data changes. Further “drill-down analyses” are available in Fiddler to identify if the causes of prediction shift are likely to pose a problem to the model’s performance.

Model bias can be partly mitigated during the process of building a model. However, the data may change when the model is deployed, resurfacing earlier issues. This can be addressed by performing cross-sectional analyses of model performance or model predictions, and monitoring performance over time over these cross-sections.

Business impact

In most instances, it is the degradation of a model’s performance that poses a threat to business impact, rather than the other way around. Model-performance metrics can fall out of sync with their intended business-performance metrics. In this case, even without an observable change in any of the discussed technical metrics (data, features, model predictions, or performance), a model may still fail to drive adequate business value. This materializes the need to track the relationship between model performance and an associated business KPI.

Key insights:

Model performance is typically the clearest indicator of a deployed machine learning model’s health.
Monitoring the model’s input data (both raw and transformed), as well as the model’s predictions, can provide advance notification of impending performance degradations. Strive to detect drift as early as possible.
Combining feature importance metrics (e.g., SHAP values) with data drift metrics (e.g., Population Stability Index) can highlight which specific features pose the greatest threat to a deployed model.
Quantify the relationship between model performance and business impact such that the relationship can be tracked at useful intervals.

Top takeaways

MLOps is really hard

Building ML-powered products is hard even for established players with massive resources. TWIMLCon presented an opportunity to learn from others’ experience.

Challenges at TWIMLCon included:

Maintaining multiple overlapping legacy systems.
Gaining adoption across a large team.
The difficulty of learning in real time.
Good models can go bad: static models can go out of date, re-trained models can cause feedback loops, and non-ML code can have a bug that makes predictions completely incorrect.
Training and inference pipelines can diverge.
There are a multitude of different capabilities people want from an “ML Platform,” and there isn’t a clear winner yet. Individual tools may have strong adoption (e.g., Kubeflow), but the complexity of piecing together multiple tools to satisfy the overall use case will be an ongoing consideration for the next few years.

Principles for building a ML platform

There were several talks at TWIMLCon about the journey a company takes to get to a successful ML platform. While specific tools and decisions are discussed elsewhere in this blog, we’ve gathered key principles below:

The goal of a platform should be to maximize actual research time

There should be a single source of truth for features.
Model deployment should be easy.
Researchers should not have to spend much time transitioning models to production.

Start from a simple solution that solves a problem, and iterate to more general solutions

“Don’t abstract things too early” — Shopify. Early iterations of a platform should be shaped by the specific problems to be solved. This requires solving a few problems to get a sense of where to go next.
“Be opinionated” — Spotify. In learning about the problem space, tradeoffs can be made to optimize the problems being solved. Spotify recommends giving the most flexibility where there is room for model improvement, and with the strongest opinions in data and experiment management.

“You can never over-invest in guardrails” — Netflix

Focusing on rock-solid guardrails has given Netflix confidence to proceed with new models and features. These guard rails encompass both structural, unit, and data tests (e.g., watching for drift and ensuring ethical model behavior).

TWIMLCon has earned a place on the ATG calendar for 2022. We validated many of our current approaches and learned about exciting new tools that can help us focus on the important business problems we’re tackling within the insurance industry. The MLOps and enterprise AI spaces are rapidly maturing, and ATG is looking forward to helping them grow as we build the foremost AI Factory in the world for risk.

Contributing Authors

Kyle Hannon is a Staff AI Engineer at Acrisure Technology Group and leads the AI Engineering team. Prior, he worked at Jump Trading, building out automated trading strategies across a variety of asset classes.

Samuel Taylor helps scientists and researchers make real business impact through good engineering. His interests include trustworthy models, imbalanced data, and (of course) putting machine learning into production.

Kelly Bean is an AI Engineer at Acrisure Technology Group. Previously, he’s helped several organizations develop and implement their first AI capabilities, from research to production. He has a M.S. in Data Science from Southern Methodist University.