Using Automation to Choose the Right Machine Learning Model for Your Production
The machine learning model, like any other software deliverable, has a lifecycle. The model owner would propose the problem statement for the model predictions, the model developer will design, develop and deploy the model, and then the model validator will test the model. The model approver will review the model validation outcomes and decide whether to approve or reject the model usage. All these steps normally happen in a sandbox, or pre-production, environment. Only after the model is approved is it promoted to production.
In this story, we will take a step-wise view on how IBM Watson OpenScale can be used to access the risk involved in selecting a model for production environment
There are multiple environments to develop and deploy, but for the purpose of this document we shall use the IBM AI Ecosystem using IBM Watson OpenScale , IBM Watson Studio AutoAI , IBM Watson Machine Learning in an IBM Cloud Pak for Data environment.
IBM Watson Studio AutoAI can generate models for us by using the training data. From the generated models, a user selects two models to deploy to the IBM Watson Machine Learning runtime. The user then configures the deployed models with IBM Watson OpenScale. Using the new Model Risk Management functionality from IBM Watson OpenScale the user assess the risk and determine which model performs better so that it can be used in the production environment.
Let’s get going …
Models creation and Deployment
Step 1: Create a project in IBM Cloud Pak for Data, followed by creation of deployment space.
Step 3: Select two models from the list and create one as a “Test” model (a Gradient boosting based model) and another as a “Pre-production” model (random forest based model).
Step 4: Promote both the models to the deployment space that we created earlier.
Step 5: In the deployment space, create a deployment for each of the models.
As a next step, we’ll evaluate which model deployment is better than the other by using IBM Watson OpenScale Model Risk Management functionality to access the risk associated for these two models.
Configuring IBM Watson OpenScale
Launch the OpenScale console from your Cloud Pak for Data cluster. (Hint: The console link would be of the form https://<cp4d cluster>/aiopenscale/insights)
Please follow these topics to configure drift, quality and fairness monitors in OpenScale:
Once we configured the monitors, then we are all set to evaluate the monitor for its risk by running the monitors against some test data.
Evaluate the Challenger model
At this point, we have come to the main purpose of our workflow: to evaluate the model risk for the gradient boosting based model.
Let’s use the Fairness monior to examine how we will evaluate the model risk. Say that the fairness threshold for the monitored group, Female, is set as 98%. This means that Female population should get at least 98% of favourable outcomes as compared to their counterparts, which in this case is, Male. Any fairness outcome less than 98% for Female is considered to be a breach, that the model does not provide fair outcomes for the monitored groups. In the case of quality measures, if the Area user ROC for the model is computed to be less than the specified threshold then the model is performing poorly in terms of its quality. And likewise, using the drift monitor, OpenScale computes the model drift and data drift.
To evaluate the monitors for their outcomes, click the “Go to model summary” link from the model configuration page, which brings you to the following page. Since no past evaluations have run for this model (we have just configured!), this page is empty. Otherwise, it displays the last and latest evaluation that’s been done.
In this page, click the “Actions” menu, and then select the “Evaluate now” option. And then click “Upload and evaluate.”
This step does the following:
- Scores the CSV content against the underlying machine learning model deployment and loads the scored output to the underlying OpenScale DataMart, specifically into the payload logging table. These scored records are used by the OpenScale Fairness monitoring and Drift monitoring to evaluate model fairness and model/data drift.
- Stores the CSV, which contains the ground-truth labels into the Feedback table belonging to the same OpenScale DataMart. These ground-truth records are used by the Quality monitoring service to evaluate the OpenScale Quality metrics.
- Evaluates the monitors by comparing the model evaluation against the threshold that is set as part of the monitor configuration.
Let it take some time for the evaluations to complete. Once the evaluations complete, it displays the following dashboard with the latest results.
Notice from the image above that the fairness for both the Age and Sex attributes is way below the set threshold of 98%. The model drift, specifically the drop accuracy, also can be observed to be 7.97% which is above the set threshold of 5%. The model quality, on the other hand, is somewhat good and calculated as 0.78. This is above the set threshold of 0.7. So, indeed, our grading boosting “test” model is not faring better..
Before before evaluating the other model, download the risk evaluation report for this model. This is a summarized report of this evaluation. To download the report, click the same “Actions” menu from the model risk management summary dashboard and then select “Download report PDF.” A sample evaluation report can be found here.
Evaluate the Pre-Production model
For the Random Forest based model that is yet to be configured with OpenScale and get evaluated, follow the same set of steps that we have performed above for the gradient boosting based model.
Once you have done the risk evaluation for this model, you would see the results as shown below. Notice that the fairness attribute, although still below the threshold, is much better than the Gradient Boosting based model. Quality is also better at 0.97, and there is no drop in data accuracy.
This seems to be the better model to promote to production.
Now let’s compare the results with Gradient Boosting based “Test” model. To do so, click the “Actions” menu in the model summary page for this Random Forest model, and then click “Compare.”. In the drop-down list, select the Gradient Boosting based “Test” model deployment. This selection compares the metrics of both the models and displays which one is faring better, with a Green color coded indicated indicating the better metric. Here you can clearly see that the Random Forest based model is performing better on the attributes we care about.
Based on the results above, let’s promote this model for production usage.
Doing so displays the confirmation in the Summary page:
Configuring and evaluate the Production model
In the preceding steps, we identified a pre-production subscription whose monitor outcomes are better. The same kind of model can be used in the production environment as well.
For this we need to perform the preceding set of steps, specifically creation of a production-based project, to create the production model and promote it to the deployment space.
Once the model is deployed in the production space, then switch back to OpenScale UI to bind that production deployment space with OpenScale. While binding the deployment space with OpenScale, make sure you select the machine learning provider and its space as “Production” this time, as in the following screen capture.
From this machine learning provider, select the production model to be subscribed to OpenScale.
Here comes the key part:
Apply the configuration of the pre-production subscription to the production subscription. There is no need to configure the production model again.
To do this in the model configuration page, click “Import settings.”
In the “Import configuration settings” page, select the Random Forest pre-production deployment and click “Configure.”
Importing the configuration from a pre-production subscription to a production subscription in a click of a button … Pretty cool thing, right?
We are all set for evaluating the production model. To do so, score the production model, which will add the payload logging data, and then add the feedback data. Go to the “Actions” menu from the Model Summary page of the production subscription and click “Evaluate now”.
Evaluation takes some time to complete, and then you are presented with the model risk summary page for the production model:
Now we are all set!
You should try this for yourself with your own data sets by following the steps detailed above:
- Created a pre-production project and and a pre-production deployment space.
- Used training data and created an AutoAI experiment, which has generated a set of models.
- Select two models and save them as “Test” model and a “Pre-Production” model.
- Promoted these models to the “Pre-production” deployment space and create online deployments for them.
- Bind the pre-production deployment space with OpenScale.
- Subscribe the “Test” model and perform an evaluation.
- Subscribe the “Pre-Production” model and perform an evaluation.
- Compare both the evaluation and notice that the “Pre-Production” subscription is good for promoting to Production.
- Created a production project and a production space and run the AutoAI experiment in the project.
- Selected a production model and created a deployment.
- Onboarded this production space with OpenScale.
- Imported the settings from pre-production subscription to production subscription.
- Scored the production model, and perform an evaluation.
The list of models and various notebooks to configure IBM Watson OpenScale can be found here.
Great job! That’s all for now!