How to Perform Semi-AutoML When AutoML is Not Enough

Published in

IBM Data Science in Practice

5 min readJul 13, 2021

This post is the first part of a two part series around how data scientists and data enthusiasts can leverage semi-automated machine learning capabilities to boost their productivity of their day-to-day work.

“Machine Learning”, “Artificial Intelligence”, “Data Science”, “Digital Transformation”, etc…

We hear these words almost daily from our clients who are some of the largest financial institutions in the world. Their interest is in infusing machine learning to every possible corner of their business, and making it more accessible to use the data is in the center of this interest. Businesses are seeking an alternative universe where their employees can be empowered to utilize the data, for citizen data scientists quickly get started with the data, and expert data scientists speed up the experimentation time with the data.

AutoML solutions have been gaining popularity exactly for that reason. In this post, I will try to show how you can interact with Watson Studio’s AutoML solution, AutoAI, to further customize your models beyond what AutoAI has done for you.

Before I get into it, I should mention that for the code you’ll see below, I will assume that the reader has access to Watson Studio (you can sign up here), has ran an AutoAI experiment and exported their notebook of the final machine learning pipeline. If you’re not familiar with this process, to learn how to get up to until this point in the experiment and export the pipeline source code, please refer to this post or the documentation. This article will assume you have pipeline source code exported from AutoAI.

Within AutoAI, technical users have the option to take what AutoAI created for them to the next level by interacting with it through the AutoAI Lale library and that’s what we will be using.

For this article, I’m using the Health Insurance Cross Sell Prediction dataset to build my AutoAI experiment, and it can be found here. Here’s a preview of the dataset:

With this dataset, we’ll be building a machine learning model of a binary classification problem. In particular, we will be creating a model to predict whether clients of an insurance company with a health insurance policy would also be interested or not interested in insurance for a vehicle.

Pipeline from AutoAI showing Column-Selector as a root. In one path, it goes numpy-column-selector to compress-strings to numpy-replace-missing-values to numpy-replace-unknown-values to boolean2float to cat-inputer to cat-encoder to float32-transform, the other path is numpy-column-selector to folat-str2-float to numpy-replace-missing-values to num-imputer to opt-standard-scaler to float-32-transform. The paths converge at concat-features, go and end at XGB-classifier — The current pipeline from our AutoAI experiment

What you see above is the visualized version of the pipeline flow generated by the AutoAI experiment. In case you are not already aware, the flow chart above is also present in your notebook and that clicking on them will take you to their respective sections as you can see in the official documentation.

Scenario

Let’s think of a scenario where the user ran the experiment but wants to perform feature selection to optimize the model. AutoAI provides the option to do feature selection in the the UI, but that’s before we run the experiment. You can re-run the experiment from the UI based on features you’d like to include in your model, but it may not be time-efficient to do it from the UI if you’re working with a dataset with a lot of features.

This is where AutoAI shines as it lets you re-run the AutoAI experiment right from the notebook after you make changes you want to the machine learning pipeline bypassing the need to go back to the UI to restart the process.

a close up of the Golden Gate Bridge in the fog.

Pipeline Source Code

When you run the below code in your notebook, the code will inject the source code of your chosen machine learning pipeline into a new cell where you will see the details of the modules in the pipeline visualization I shared above:

The code above will inject the below code (pipeline source code) to the notebook

This long notebook cell will end with a combined pipeline with steps of your selected pipelines created by the AutoAI experiment. This pipeline is the last variable named “pipeline” in the code block you see above. As you can see, the final variable on the above code block represents the final pipeline that provides us the predictions that we’re looking for so once we add our own changes into the pipeline, and then re-run it, we will accomplish our semi-automated AutoAI experiment.

Transformed Data

For this example, we’re going to work on feature selection before we re-run our experiment. Our first step is to bring back the transformed dataset from the experiment:

The code block above will help us retrieve the raw training data we used in our experiment to train our model and the raw testing data we used to validate our results.

With this code block, we’re retrieving the transformed train & test data from our pipeline, and then exporting as an sklearn pipeline so that we can access the data easily. The optimizer by default outputs the pipeline as a Lale pipeline which makes it hard to retrieve the data. When transforming the dataset, we lose the column names, so we’re using the column names from the feature transformation module TA1.

Feature Importance

Once we call the transformed dataset back from our pipeline and put it in the original format we fed to the AutoAI, then from there we can get the feature importance, or this can be performed in the beginning.

Once you have the feature importance (which is very useful for model explainability), you can make an educated decision around what features to select. The feature selection is made to eliminate irrelevant or noisy features, to reduce complexity or to improve the model’s performance. This is use case specific, so I will be keeping all the features for this demonstration.

As you can see above from the feature importance dataframe we just created, out of top 20 features, 10 of them are AutoAI engineered features which helped improve model accuracy.

Updating the Pipeline

Since we now know the feature importance scores of our features, and have the AutoAI-processed dataframe above with the new features, we can utilize the column selector operator of the Lale library and declare which columns we want to include in the pipeline we’re updating. For this example, I’m including all features by saying list(pd_output.columns), but you can include or exclude any columns in the code that you wish.

Once the columns are selected, we just need to update our original pipeline by adding the column selector module at the very end, right before our estimator, which in this case is a XGB Classifier. Thus, once our data goes through the steps of the pipeline and is ready to be consumed by our estimator, we can tell it to select only certain features from the transformed dataset.

Refitting the Model

With our pipeline updated, we can now use the Hyperopt library to fit our data and do automatic hyperparameter optimization, and finally get our predictions.

Fin!

With this example, we learned how to modify our machine learning pipeline generated by Watson Studio’s AutoAI. We ran the AutoAI experiment, exported our pipeline, exported our transformed dataset, selected features we want to include in our new pipeline, and re-trained our model. As mentioned in the beginning, in situations where we need flexibility around the model we’re building, the AutoAI Lale Library will be useful to help get the results we’re looking for.

For the reasons I shared above, automatic machine learning model building solutions have been an interest area for many. These solutions might be enough for someone with little to no experience with data science, but for someone who’s looking to milk every ounce of performance from a machine learning model, the capability to refine a model turns into an important feature when looking for an AutoML solution.

Part 2!

In part 2 of this post we will showcase another semi-automated approach to AutoML. Stay tuned!

Special thanks to Catherine Cao (Data Scientist, IBM) for her contributions to the code.