Machine Learning model governance at scale

Published in

Data Science at Microsoft

14 min readNov 13, 2020

Author Peter Saddow is joined for this article by co-author Daniel Yehdego

Today, predictive models help many companies run their business-critical processes. These predictive models are probabilistic — they involve chance variation — and they depend on some underlying assumptions to make them work properly. Chief among these are 1.) that the data flowing into them is distributed in a certain way and 2.) that underlying business scenarios are unchanged from the time the models were created. As a result, it’s important for companies to monitor changes to business scenarios and data quality, otherwise the model will no longer perform as intended.

Additionally, it’s important that these models comply with applicable laws and regulations once they are in production (active use). Because their deployment and improvement can have far-reaching impacts on users, business partners, and investors, transparent practices must be in place to manage the models efficiently and mitigate risk. A defined governance process is necessary to set out rules and controls for production models, including controlling access, testing, validating, logging changes and access, and tracing model results.

Models based on machine learning (ML) need governance just like other models, but perhaps even more so. This is because ML models are typically designed to improve automatically through experience. Although their ability to “learn” enables greater accuracy and predictability, it can also increase the types of risk mentioned earlier, as well as result in unintended biases. This dynamic nature of ML models means they require more frequent performance monitoring, constant data review and benchmarking, better contextual model inventory understanding, and actionable contingency plans.

It’s also important for teams to understand how machine learning is implemented and applied in their organizations so they can govern their models effectively and scale them with ease. A canonical model catalog is a necessity — without it, many companies do not know how their models work or even what they are used for. This leads to many problems, including repetitive jobs, incompatibility among models, recalculations, and others.

As a result, it’s essential to establish rigorous governance processes that can quickly identify when an ML model begins to fail, complete with defined operating controls on inputs (data) and output (model results). The time to implement effective ML model governance is now — before today’s global, multidimensional marketplace and massive data volumes overload traditional model risk guardrails and governance practices. With the right governance in place, organizations can implement and use ML models — and the big data that typically accompanies them — securely and efficiently.

**Microsoft Cloud+AI Customer Growth Analytics (CGA) Model Governance Framework:** Machine Learning Governance Model

ML model development lifecycle and roles and responsibilities

Here in the Customer Growth Analytics (CGA) team in the Cloud+AI division at Microsoft, we follow a process to develop and deploy our models that includes the following stages:

1. Conception
2. Prototype/model evaluation
3. Production ready
4. Deployment
5. Production monitoring
6. Deprecation

These stages may vary depending on an organization’s culture and maturity level, but the one we’ve outlined should be sufficient to go from an ad hoc to well established process. Although typically these stages are sequential, multiple stages often happen in parallel.

Details of these stages

Conception: This is when it’s determined that a machine learning model is needed. The idea may come from the model owner, leadership, or a program manager. During this initial phase, all parties might not be fully committed and some necessary parties might not even yet be involved.
Prototype/model evaluation: During this stage, the model owner works on building the model to validate the initial idea and the assumptions behind the model. This phase helps define the scope and expectations of the project and the impact of model to formalize the plan moving forward. It’s important during this stage to secure the support of all team members who might be involved. If, however, during this stage the prototype does not validate initial assumptions, it’s important to seriously consider whether to continue.
Production ready: This stage is dedicated to getting the model ready to run in a production environment by meeting and exceeding compliance requirements and any privacy criteria that apply. Model code and all dependent components should be fully automated to ensure the model can run with no manual intervention. The plan on how the model will be supported in production should be documented, including who will be responsible for addressing issues as they arrive, the expected timeframe for addressing them, and who else should be contacted or notified, including the responsible owner.
Deployment: In this phase the model is moved to the production environment and set to execute according to a defined schedule. Data sources are also onboarded and refreshed according to a set schedule. The engineering team or another team might be responsible for the actual deployment and requires support from the model owner to resolve any deployment issues. It is important to establish a data contract with upstream data providers and stakeholders to ensure expectations are documented.
Production and monitoring: In this stage, the model is running according to the defined schedule and is being monitored. Any issues encountered during execution are addressed according to the supportability plan already established. The root cause of any failures should be investigated and understood to ensure they don’t repeat. In this stage it is important to continuously improve the infrastructure and supportability of the model to improve its quality going forward.
Deprecation: This stage applies when a decision is made to no longer support the model because the cost is too high, a newer model exists, or adoption isn’t meeting expectations. Prior to stopping a model running in production, it is important to notify stakeholders and other users and provide a path forward for them.

Other widely used ML model lifecycle process include the Cross Industry Standard Process for Data Mining (CRISP-DM), the Team Data Science Process (TDSP), and Knowledge Discovery in Databases (KDD).

Communicating with stakeholders and leadership

Throughout the ML model lifecycle phases, it is important to provide communication on a regular basis to stakeholders, leadership, the core team, and other affected teams. This keeps them apprised and engaged. In the early stages, communication includes explanations of project status, risks, and expected delivery date to production. When the model is in production, communication focuses on the availability of model output, any delays in model output, and changes to model output or backwards compatibility. Automating communications and status reports as much as possible helps to ensure frequency and transparency.

In CGA, we use a single distribution list (DL), which includes all our end users and stakeholders, to disseminate the following communications. You may also want to consider whether you want to have different DLs for different consumers of the information.

Communication examples

Our communications fall into three categories, including action notifications, alerts, and model status reports.

Action notifications indicate when users must make a decision or change. The following example illustrates a notification of a new field being added, which might affect users.

Alerts let stakeholders know about a model delay. In these situations, it is often necessary to send an initial communication when the delay becomes known and another when it is resolved.

Model status reports present the run status of an ML model owned by the team. Information includes model assets such as model owner, model run frequency, model run status, and external communication. The report also includes a stakeholder filter to enable stakeholders to quickly see the models they are dependent on.

Model metadata and execution status

The ability to send regular automated communications depends not only on having deployment status, but also on having related metadata in place to record status and other model properties. Start with basic questions, such as:

How are machine learning models defined?
How many models do you have in your platform?
Which of the models are currently being used by stakeholders?
When was the last time the models ran?
Who are the primary people to answer model-related questions, such as associated data scientists, data engineers, program managers, or other subject matter experts?

These questions might seem simple and obvious — but many teams have trouble answering them, or even coming up with them.

Teams can track these by incorporating them into their ML model governance process. Having a robust model metadata inventory system helps in many scenarios — here is a sample of model metadata the CGA team tracks, along with associated use cases:

Model performance monitoring (MPM)

Monitoring the performance of a model involves two key aspects. One is concept drift and the second is data drift.

Concept drift

Concept drift is when the statistical properties of our model output — or more specifically the value that we are trying to predict — changes over time. It can change due to a business scenario change or when the business itself is changing. It can also change because the training data that we used to train the model does not represent the entire population.

For example, let’s pretend we can go back in time and apply a model to predict churn in the earlier days of the telecom industry. At first, phone numbers were bound to service provider, such as AT&T, T-Mobile, and so on. Customers who were not happy with their service were reluctant to make a change because they did not want to change their phone number. Imagine developing a churn model at that time when there was no number portability, and that it predicts churn well in that environment. Now think back to the time when number portability was introduced, and many customers who were unhappy with their service started moving to different service providers while keeping their phone number. With this change in business scenario, uncontemplated when the model was created, the prediction is no longer valid: A key assumption it was based on has changed. A new business scenario has come into existence that can cause a higher amount of churn.

As a result, the model must be recalibrated. Because we will not have much data for some time to use for a recalibration, we may need to consider alternative approaches such as a different probabilistic model or perhaps a deterministic model — one in which no randomness is involved — around whether customer incentives can be given.

It’s important to note that continuous monitoring, as we’ve advocated earlier, may not alert or notify us about this sort of change. This doesn’t mean that monitoring is not valuable, but it does mean that even thoughtful automated monitoring may not detect concept drift.

Data drift

Data drift is when a particular feature or data used in the model changes. Because the model relies on the data it receives, its predictive power can be affected by changes in data or the quality of the data. For example, the upstream process that is sending data might be changed, affecting data downstream and causing the model to fail. Or, data that is coming from a third-party vendor might experience a change in logic in the data that affects the model. In these cases, the model will still try and make predictions based on the initial trained model, but because of the data issues, it cannot succeed as intended.

Additionally, business scenarios might change, causing new or additional data elements or categories to be introduced. For example, if a categorical variable changes or a new product category is introduced that the model has not yet encountered, it may be necessary to recalibrate and retrain the model so that it predicts correctly on these new instances.

In CGA, we have automated the process of model performance monitoring with the following capabilities:

Check to see whether the data inputs to the model are similar to those seen during training.
Check to see whether model prediction distribution is similar to what was seen during model training or model validation.
Ensure models are performing as expected as per model-level agreements over time.
Alert on data or model drift to drive action as soon as the model violates some of the key underlying assumptions.
Take down the model immediately from decision making for critical business processes.

Model validation

The goal of model validation is to identify shifts in ML model system behavior that conflict with expectations. There are two main ways an ML model system can go wrong:

Data science issues (data monitoring, scoring monitoring)
Operations issues (system monitoring)

ML Monitoring workflow (*source: Breck et al. (2017)*)

Naturally, we are interested in the accuracy of our model(s) running in production. Yet in many cases it is not possible to know the accuracy of a model immediately. Consider a fraud detection model: Its prediction accuracy can be confirmed on new live cases only if a police investigation occurs or some other checks are undertaken (such as cross-checking customer data with known fraud perpetrators). Similar challenges apply in many other areas where we don’t get immediate feedback (e.g., disease risk prediction, credit risk prediction, future property values, long-term stock market prediction, and so on). Given these constraints, it is logical to monitor proxy values for model accuracy in production, specifically:

Model prediction distribution (regression algorithms) or frequencies (classification algorithms).
Model input distribution (numerical features) or frequencies (categorical features), as well as missing value checks.

Model input monitoring

Given a set of expected values for an input feature, we can check that a.) The input values fall within an allowed set (for categorical inputs) or range (for numerical inputs), and b.) The frequencies of each respective value within the set align with what we have seen in the past.

Depending on our model configuration, we allow certain input features to be null or not. This is something we can monitor. If features we expect generally to not be null start to change to null, that could indicate a data skew or change in consumer behavior, both of which would be cause for further investigation.

Here is an example of how we enabled dataset monitors in Azure Machine Learning (AML) to enable this support:

Model output monitoring

In either an automated or manual process we can compare our model prediction distributions with basic statistical tests such as median, mean, standard deviation, and maximum/minimum values.

One simple test it to compare whether the mean values fall within the standard error of the mean interval under the assumption of normally distributed variables. In addition to these types of basic statistical approaches, more advanced tests can be implemented to compare the distribution of the variables. Implementing advanced statistical tests, such as t-test, ANOVA, or Kolmogorov Smirnov in a monitoring system can be difficult, though it’s possible.

Here is an example, where we can compare distribution of counts month over month.

Model dependency

Models do not exist in isolation. Often, data flows into models, generating output data, which can then be fed into other models. For example, if you have a model that projects future churn probability, it will likely have output that feeds many downstream models taking churn probability into consideration. There may be unintended consequences because of these interdependencies — for instance, the downstream models can be disrupted when data or models are changed upstream.

Models may be dependent on variables that are created or stored by other ML systems (internal or external). These systems may change the way they produce the data, and sadly, it’s common that this is not communicated clearly. The knock-on effect is that today’s variables are not identical to those created a couple years ago. Either the code of the function changes, the results may be slightly different, or a feature description may change. For example, an external system may adjust the driving age from 16 to 18. If driving age is a significant feature in the model, this will change its predictions.

ML models may also depend on various packages, and each one of these packages will have multiple versions throughout its lifetime. When someone outside your team changes a package under their responsibility, whether it occurs upstream or downstream, it may ultimately change or break the expected functionality of your model. If it breaks, it is a matter of figuring out where the exception came from, and this can be time consuming, especially when it’s external to the codebase. Sometimes you will not be aware there is a problem until a client complains.

In these scenarios, monitoring model dependencies in each environment saves expensive hours by shortening time to detection, time to response and resolution, and most importantly, shortening the time that customers and stakeholders are affected.

CGA is building a model dependency asset inventory that makes it easy to identify model dependency and tracks it over a timeline. Dependency version history is mapped to our ML model prediction so that we can figure out what went wrong, when it went wrong, where it went wrong, and deal with it quickly and easily.

Model usage telemetry and stakeholder tracking

To validate the impact of a model, it is important to establish metrics such as revenue affected or customer adoption. After the metrics are agreed upon, it is important to automate them through telemetry. At a minimum, model telemetry should measure customer adoption, feedback from users, and model execution status. The model telemetry should validate assumptions made during development of the project about stakeholders expected to adopt. In CGA, we manage more than 60 ML models and numerous stakeholders. Being able to track the relationship between the models and stakeholders is critical to ensure we are providing accurate communications to appropriate stakeholders, and that our analysis of model impact is accurate.

Conclusion

In this article, we walked through several aspects of model governance, including scenarios that CGA has implemented, and presented the need for model governance at scale. We provided the fundamentals of model governance and described the ML development lifecycle, including roles and responsibilities. By utilizing the best practices discussed above — communication, ML model metadata status, ML model output and performance validation, recalibration using automated feedback, and interpretability — teams and organizations can better satisfy increased regulatory demands when implementing complex, robust, reliable and automated ML model system infrastructures.

We would like to thank Ron Sielinski and Casey Doyle for helping review the work.

Also by this author:

ML program management at scale (Part 1 of 2)

A deep dive into the role and its responsibilities.

medium.com

ML program management at scale (Part 2 of 2)

As data science and machine learning (ML) evolve, the role that program managers play on ML teams is likewise evolving…