Operationalizing AI — Managing the End-to-End Lifecycle of AI

Published in

Inside Machine learning

8 min readJul 19, 2019

As they journey toward AI, most organizations establish data science teams staffed with people skilled in ML/DL algorithms, frameworks and techniques. Yet, many of those organizations struggle to make their AI projects truly relevant to the business, instead failing to get the projects into full production and integrated with existing applications and processes. It’s why so many line-of-business stakeholders consider only a small percentage of AI projects to be true successes.

We’re seeing clients across industries quickly recognizing that they need a systematic approach to “operationalizing” AI in order to drive AI success. That approach means managing the complete end-to-end lifecycle of AI. Some refer to this set of considerations as “AI Ops” (but we will stay away from that term as it is sometimes also used to refer to AI for IT Operations)

Considerations for “Operationalizing AI” include:

Scope: How can a data science team scope the deluge of use case requests and decide on priority? What role do business KPIs play on the road to success?
Understand: What datasets does the team need to implement the use case?
Build: What tools and techniques are relevant for data preparation, feature engineering, model training and building?
Run: How does the team deploy and track versions of models and make them available for scoring, whether online or batch?
Manage: How does the team ensure high performance via monitoring and retraining? How do they establish trust and transparency by attending to bias detection, fairness, explainability, etc.?

We know that the Build-Run-Manage paradigm needs to integrate with existing CI/CD mechanisms to shepherd data science assets from Dev to QA to Production. It also needs to work with popular platform technologies like Docker and Kubernetes / OpenShift. We know that the Manage paradigm needs to establish correlations between model metrics and business KPIs.

Let us take a closer look at each of the five areas.

1. Scope

A data science team is a precious resource for any organization, but they’re often swamped with use case requests from across the company. How can the team scope and prioritize the requests to ensure optimum use of the team?

They start by exploring business ideas in detail, insisting on clarity around the business KPIs. Without that clarity, a team is far too likely to judge the success of a project by the performance of the model — rather than by its impact on the business. Consider an example. A company’s HR department wants to use AI to predict which managers will be high performers. In this case, the business metric might be something outside the feature set and model output — say, a business KPI like employee attrition rate. How will the data science team capture this metric if it’s neither a model input nor a model output? How will they hope to correlate model performance metrics like precision/recall to this external business KPI?

Too often, a business captures its KPIs in emails, slides, or meeting notes, where they’re impossible to track, especially as they change. Insist on capturing and understanding business KPIs up front (even if they change later). Doing so allows the data science team to prioritize and create swim lanes for each use case they undertake. Later in the lifecycle, these KPIs will need to be evaluated and correlated to model performance.

2. Understand

We all know the adages, “There is no Data Science without Data” and “Garbage In, Garbage Out”. It’s clear that good data stands at the heart of any successful AI project. But it’s still worth asking whether the data science team fully understands the data sets they’re dealing with. Do they understand metadata and how it maps to a business glossary? Do they know the lineage of the data, who owns the data, who has touched and transformed the data, and what data they’re actually allowed to work with?

On one occasion, we were working with a client data science team on a fraud detection use case. Our models performed well on hold-out data but performed poorly in production. We considered whether we needed to re-train the model to catch new fraud patterns, but no: something else was fundamentally wrong. Much to the surprise of all, we learned that we’d been working with data generated using rules, not ground truth, despite assurances from the data provider. It was a failure to understand and insist on data lineage — and a demonstration of how important it can be to work with a data steward.

Empower your data science team to “shop for data” in a central catalog — using either metadata or business terms to search, as if they were shopping for items online. Once they get a set of results, give them the ability to explore and understand the data — including its owner, its lineage, its relationship to other datasets, and so on. Based on that exploration, they can request a data feed. Once approved (if approval is needed), make the datasets available via a remote data connection in your data science development environment. As much as possible, respect data gravity and avoid moving data. If the team needs to work with PII or PHI data, establish the appropriate rules and policies to govern access. For example, you can anonymize/tokenize sensitive data. This stage is really about understanding the data you need for your AI initiative within a context of smooth data governance.

3. Build

Most data scientists relish the build phase of the AI lifecycle — where they can explore the data to understand patterns, select and engineer features, and build and train their models. This is where a myriad of tools and frameworks come together:

Open languages — Python is the most popular, with R and Scala also in the mix.
Open frameworks — Scikit-learn, XGBoost, TensorFlow, etc.
Approaches and techniques — Classic ML techniques from regression all the way to state-of-the-art GANs and RL
Productivity-enhancing capabilities — Visual modeling, AutoAI to help with feature engineering, algorithm selection and hyperparameter optimization
Development tools — DataRobot, H2O, Watson Studio, Azure ML Studio, Sagemaker, Anaconda, etc.

Of course, the challenge is to select the right options to support a smooth pipeline for data preparation, feature engineering, and model training.

4. Run

Let’s say the team has built a model. In fact, suppose they’ve built multiple versions of the model. How do they actually deploy them? After all, there is no point in the models staying inside the data scientist’s workbench.

The team should be able to publish metadata around model versions into a central catalog where application developers can “shop for models”, much as the data science team shopped for data. The actual invocation of the models depends on the use case. What it looks like to access the models can vary: Possibly developers invoke a model over a REST API call from a web/mobile application or process. Possibly they use a model to score a million records in batch mode at start of day for a wealth management advisor. Possibly they use a model to perform an inline scoring in near real-time (less than a few milliseconds) on a backend mainframe system in order to authorize a credit card transaction. Possibly they use a model to execute a series of classifiers in real-time to detect intent (and change in intent) as a call center agent interacts with a customer.

Regardless of the mode of invocation and latency requirements, you see a common paradigm: deploying model versions into an appropriate runtime. Most popular development environments have a runtime component that facilitates the Run concept, albeit with differing levels of sophistication. The data science team needs to work with business and IT stakeholders to understand what Run option makes sense.

But how did the team go from Build to Run? If they were doing a skunkworks science project, it might be fine to simply hit the “Deploy” button. Not so in an enterprise setting. The data science team needs to follow the same rigor of application development in promoting assets through each of these stages. To do so, they need to think about how assets move from Development to QA/Staging to Production. In many companies, moving a set of assets from Dev requires a sequence of steps: code review, third-party oversight, running a series of unit tests (often with different data sets than what the developer used), approval, etc. Thinking of data science development in similar terms brings us into a Continuous Integration/Continuous Deployment (CI/CD) paradigm, supported by a CI/CD pipeline.

Most enterprises already have CI/CD mechanisms in place — GitHub Enterprise or BitBucket as the source repository, Jenkins or Travis as the pipeline, Nexus for binaries, etc. The data science pipeline needs to find ways to fit into these existing enterprise mechanisms. Ideally, when the data science team finishes the Build phase, they can tag, commit and push all assets (Jupyter notebooks, pipelines for data prep, the actual models, evaluation scripts, etc.) into the appropriate repositories. In particular, the commit and push can initiate a CI/CD pipeline that follows a sequence of steps (test, review, approve etc.), and then creates a set of deployments in the Run environment. We can think of this happening between Dev, QA/Staging and Production, with slightly different steps for each transition.

5. Manage

Once the models are in QA or Production, it’s crucial to think about both performance and trust/transparency.

Regardless of how well a model performs when the team first deploys it, the model will almost certainly degrade over time. New fraud patterns, new customer intents, and other changes in the environment that weren’t present in the training data mean that the data science team needs ways to monitor the model metrics and evaluate them against thresholds on a schedule. If the metrics breach a threshold, the system can alert the team or even initiate an automated process for retraining the model.

As they continue to operationalize AI, stakeholders are increasingly concerned about trust and transparency. They want to know whether they can explain model behavior. They want to know which changes by what margin to what features would have changed the model outcome. And they want to know that the model is behaving with fairness within a range the business has specified. If the model recommends rejecting a loan, can the data science team play that specific transaction back and explain to a regulator why it was rejected? Can they prove that the model is being fair within a given acceptable range of fairness? Can they prove that the model is not incorrectly or inappropriately biased?

Note that these requirements become especially challenging when the model doesn’t actually use the protected features. Consider the earlier HR application example. If the business decides not to use race, gender and age as features in the model, but still wants to understand how those attributes play a role, they can take the output of the model and compare against those attributes external to the model. But without the features in the model itself, it’s a challenge to evaluate model explainability, fairness, indirect bias, etc.

Another challenge is correlating model performance metrics and trust/transparency elements to the business KPIs that we captured in the Scope stage. Without this view, it’s difficult for the business to get a view of whether an AI project is succeeding.

IBM has a comprehensive PoV and an integrated set of offerings around managing the complete lifecycle of AI. We invite our enterprise clients to reach out and collaborate with us to elevate AI from the “art of the possible” to “driving digital transformation”.

(Thank you Matthew Walli, Bill Mathews, Kevin Hall and other IBMers who are collaborating with me on this topic)