Using Kubernetes Operators to Manage the Lifecycle of AI Applications

Benjamin Herta
4 min readSep 29, 2020

--

Over the past year, we in IBM Research have been considering the difficulties around the care and feeding of AI applications and their associated AI models. There has been a lot of great progress over the past several years to improve model accuracy in many different domains, and while a lot can be gained by adopting new model architectures, the process of retraining models, validating models, getting proper sign-offs, and so on, adds cost and risk that can make it difficult to bring these improvements into production. Even the best models can get stale over time, as customer preferences change, assumptions built into the models change, regulations change, and new data becomes available. The easier it is for your organization to keep your models fresh, the more value you will get from your AI applications. This requires some serious attention to automation, and visibility into what the automation does.

One of the first things we started with was to look at all the steps that data and models need to go through in the AI lifecycle. These run the gamut, from dataset topics such as selection, access controls, suitability analysis & cleaning, to model building, analysis for bias, validation, deployment, and production monitoring. We looked at what tools are used at each step, what steps may require sign-off, and how can we put all the pieces together, at an API level. IBM’s Cloud Pak for Data collection of services, including Watson Knowledge Catalog, Watson Studio, Watson Machine Learning, and Watson OpenScale, all provide tools that cover different portions of this lifecycle. You may have seen demos that show the web interface for each of these products and how they can be used by data scientists and operations teams, and there are example notebooks and other documentations for each of them that shows how to use them at an API level. But if our goal was to make it easy to keep AI models fresh, that required code that knows how to perform every step, including checks and error handling, code to orchestrate those steps, and code to trigger execution of the orchestration.

Working closely with the product teams, we built up a series of components to handle each of the lifecycle stages, added some simple orchestration, and a number of example workloads to test it all with. Our access to product developers ensured we were benefiting from the latest APIs and client code, and where there were gaps in functionality, our research team contributed code back to the product teams. With the recently released cpdctl command-line utility, and a new pipeline editing and execution interface to orchestrate steps that use it coming soon, you too can compose and automate the execution of most AI lifecycle stages, and these products will continue to gain new capabilities over time.

The next chapter for us in research is to shift from the individual steps that make up the AI lifecycle, to a more sophisticated orchestration of those steps. To do this, we are leveraging the capabilities of OpenShift Operators to enable a Kubernetes cluster to understand and manage the lifecycle of AI applications, in the same way it can manage the lifecycle of distributed databases, Apache Spark jobs, or any other Kuberntes native application.

Our operator supports a custom resource definition (CRD) called simply “aiapp”, which represents one or more AI models that belong to a single AI application, and optionally the application itself. The purpose of this operator is to manage the lifecycle of the models associated with the application, from data evaluation, to model building, model evaluation, deployment, and monitoring. The operator is responsible for deployment and maintenance of deployed models, by potentially taking action whenever enough newly labeled data becomes available for retraining, when monitors indicate the models have drifted because production data is drifting from the data the models were trained with, whenever code changes are made to the any of the executing components, when policies related to the data that the models were trained from changes, and so on.

Progress of the deployment of a new aiapp instance

Today we are in Phase 1, creating an operator that can handle basic deployment and removal of the models, and bring it to some particular stage in the lifecycle, automating all the steps to get from the current state to the desired state. We will continue to leverage Cloud Pak for Data components such as Watson Studio and OpenScale for each step. We support the ability to plug the details of each unique AI application into the jobs that the operator runs, so that you have control over the unique details of what happens at each stage in the lifecycle of your aiapp when you need it. For example, at the model build stage, you may want to provide the operator with your own AI model architecture and parameters, or you can choose to let the operator handle this for you by utilizing AutoAI. You can choose which OpenScale monitors make sense for your application, or you can allow the operator to select the most common ones for you.

As we move into Phase 2, we will add new triggers, so that the operator can re-build and update the deployed models whenever appropriate, improve efficiency by skipping states as needed, and more interactive, by providing better visibility and control to customers. We will continue to add new capabilities over time, bringing in support for more error detection and handling, metrics, backup/recovery, and scaling, for AI applications.

If you’re interested in automation for all the stages of the AI lifecycle, we’d love to hear from you.

This work is being done by Benjamin Herta, Gaodan Fang, Punleuk Oum, Archit Verma, Debashish Saha, and others at IBM Research AI.

--

--