Incident Management in Machine Learning Systems

How to respond and fix ML Systems when they break?

Vimarsh Karbhari
Acing AI
6 min readApr 30, 2020

--

Incident management starts with realizing that there is an active incident with one of the ML application systems. Critical areas for ML systems are the model, service and infrastructure. The right systems, team and process is the key to responding and fixing faulty ML systems.

Photo by Sincerely Media on Unsplash

Alert Systems

Alert systems help the team to monitor and receive alerts to orchestrate incident response workflows. Additionally, the teams should also use, AWS Personal Health Dashboard which provides alerts and remediation guidance when AWS is experiencing events that may impact the team. Each cloud provider provides with a health dashboard that should be leveraged.

Once you get alerts, these alerts could just be notifying the team about usages, flag changes, slowness of applications, service disruption, infrastructure downtime or model anomalies. Based on the nature of the alerts, there should be a triage process which classifies alerts in different categories. Each category has a group of people who get alerted based on the severity of the category and the type of alert. Each team has varying degree of tolerance on issues of availability and persistence, the team should be able to set up an appropriate triage process based on their tolerance level.

An AI/ML model could make a difference of life and death if deployed in a critical scenario. IBM Watson for Oncology, which promised to revolutionize cancer treatment, got cancelled after the University of Texas MD Anderson Cancer Center spent $62 million, but Watson was still making terrible treatment recommendations. Such recommendations from a model should be logged along with reporting and alerting on the is also critical areas.

Tools

From a tool perspective Pagerduty, Opsgenie, Wavefront, Datadog, Splunk are optimal for alerting. Each team will use more than one tool for different parts for their stack. Splunk/Datadog logging can be used to log models and can help to measure model values in different areas. Wavefront can help with application analysis and Pagerduty/Opsgenie can be used for infrastructure monitoring and alerting.

Response team

However strong of a data science team one has, no one can eliminate edge cases entirely. The real world is a series of endless edge cases. It is important amazing to have a specialized team that can fix AI/ML fast when it stumbles.

This is usually the response team which is on the other side of alerting systems. Once they receive the alerts they take the next the steps to mitigate the issue. Sometimes, these teams could be referred as on-call teams. The team should have a mix of engineers with the know how about the engineering applications, data engineers who are aware of the ML system/model component and MLOps/Dev ops engineers who can help with mitigation.

The job of the response is to think of everything that can and will go wrong with AI/ML systems. The team has three major jobs:

  • Triage short-term problems within the ML application
  • Find solutions for long-term problems, like drift and hidden bias
  • Build unit tests and design end-to-end machine learning pipelines to make sure the issues do not happen with any model or its versions when passes those tests on the way to production.

There should be an additional communication liaison on this team. The role of the communication liaison becomes more important the higher the visibility of the ML model. This could be the engineering manager or a PR communication executive or anyone in between who knows how to give the right kind of updates.

Incident Management Process

It is important to have incident management process which starts from receiving an alert to doing a RCA and finally releasing a long term fix.

Communication responses

This runs akin to the entire process. Each phase of the process should have the appropriate communication go out to different parts of the organization and beyond. Communications to internal parts of the organization help create alignment, get resources and also set expectations about time.

In terms of external communications, teams should proceed with caution and always have PR and legal liaisons run a check before those communications go out. This group will need to spend time with engineers and data scientists to understand AI/ML decision-making at a high level, with a special emphasis on the kinds of errors it makes versus the kinds of errors a human makes. They will also need to develop effective ways to describe AI/ML models in simple, straightforward language for non-tech people to understand.

Triage: ML + ML Ops and Full Stack application teams

This is a crucial part of the process. The first step would be to analyze all the alerts and the logs to locate the problem in the system. The problem could be due to an issue with the software application, backend infrastructure, ML system or model or at the intersection of any of these areas. Divide and conquer works best to analyze and find the problems.

Issue resolution and RCA

Once the problem is discovered, the team should decide on strict timelines on short term fix, long term fix, RCA and roll backs. Usually, the first step is to assess the risk associated with each of these situation. A botched ML system/model deployment could be rolled back to prevent further damage. If there is a newly found issue with the ML system/model which did not exist or was not known previously, there should be an evaluation for a short term and long term fix and the team should take the appropriate decision to address the same. Issue with the implementation code could be addressed the same way. Sometimes, a short term fix could be followed by a long term fix which could be highlighted by the RCA.

Regardless of whatever the approach there should be a RCA done for the issue. The RCA helps to get a closer look at the issue and also outline the appropriate plan and action items to add testing and closure to the issue.

In terms of the issue resolutions common solutions could be one or more of the following:

Dataset upgrades:

  • Expanding the dataset to include more varied inputs. This could be done either by buying another dataset or creating another one internally.
  • Augmenting the existing dataset with additional inputs.
  • New synthetic data set (This should be modeled based on existing datasets to avoid introducing new issues)

ML System/Model updates:

  • A botched up new ML system deployment could be mitigated by adding a canary model setup to check a new model version to the older model version.
  • Adding a new model which can create a composite model to improve overall model efficiency.
  • A new Generative Adversarial Network (GAN) to make the existing model stronger by spoofing/faking values.
  • Building a complimentary rule-based system to augment the ML model.

Testing

It is not enough to just test the accuracy of your models. When we deploy newer systems/models, we will also discover newer edge cases. Those edge cases could come with real world consequences in action. The team will need to investigate its data pipelines and add automatic unit tests to make sure the model can deal with those. When your models retrain on new data or the algorithms get updated, the team should make sure the issue does not happen again on the same model version or any other models. This is described in detail in the QA in data science.

Release long term fix

Usually a short term fix should be followed by a time bound long term fix. The QA process for the long terms fix and release can be referred in the QA in data science.

Recap and Recommendations

Incident Management is a very important part of modern software systems and now ML systems as well. In my experience, having the right people design and re-evaluate the incident management process and having the appropriate team to execute the process whenever new incidents arise is the secret to success in such scenarios. There is no silver bullet but there is always calculated risk and deployment of the right human capital which has consistently solved these problems.

References: Incident Management Video

Subscribe to our Acing Data Science newsletter for more such content.

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

--

--