How to Perform Semi-AutoML When AutoML is Not Enough
This post is the first part of a two part series around how data scientists and data enthusiasts can leverage semi-automated machine learning capabilities to boost their productivity of their day-to-day work.
“Machine Learning”, “Artificial Intelligence”, “Data Science”, “Digital Transformation”, etc…
We hear these words almost daily from our clients who are some of the largest financial institutions in the world. Their interest is in infusing machine learning to every possible corner of their business, and making it more accessible to use the data is in the center of this interest. Businesses are seeking an alternative universe where their employees can be empowered to utilize the data, for citizen data scientists quickly get started with the data, and expert data scientists speed up the experimentation time with the data.
AutoML solutions have been gaining popularity exactly for that reason. In this post, I will try to show how you can interact with Watson Studio’s AutoML solution, AutoAI, to further customize your models beyond what AutoAI has done for you.
Before I get into it, I should mention that for the code you’ll see below, I will assume that the reader has access to Watson Studio (you can sign up here), has ran an AutoAI experiment and exported their notebook of the final machine learning pipeline. If you’re not familiar with this process, to learn how to get up to until this point in the experiment and export the pipeline source code, please refer to this post or the documentation. This article will assume you have pipeline source code exported from AutoAI.
Within AutoAI, technical users have the option to take what AutoAI created for them to the next level by interacting with it through the AutoAI Lale library and that’s what we will be using.
For this article, I’m using the Health Insurance Cross Sell Prediction dataset to build my AutoAI experiment, and it can be found here. Here’s a preview of the dataset:
With this dataset, we’ll be building a machine learning model of a binary classification problem. In particular, we will be creating a model to predict whether clients of an insurance company with a health insurance policy would also be interested or not interested in insurance for a vehicle.
What you see above is the visualized version of the pipeline flow generated by the AutoAI experiment. In case you are not already aware, the flow chart above is also present in your notebook and that clicking on them will take you to their respective sections as you can see in the official documentation.
Scenario
Let’s think of a scenario where the user ran the experiment but wants to perform feature selection to optimize the model. AutoAI provides the option to do feature selection in the the UI, but that’s before we run the experiment. You can re-run the experiment from the UI based on features you’d like to include in your model, but it may not be time-efficient to do it from the UI if you’re working with a dataset with a lot of features.
This is where AutoAI shines as it lets you re-run the AutoAI experiment right from the notebook after you make changes you want to the machine learning pipeline bypassing the need to go back to the UI to restart the process.
Pipeline Source Code
When you run the below code in your notebook, the code will inject the source code of your chosen machine learning pipeline into a new cell where you will see the details of the modules in the pipeline visualization I shared above:
This long notebook cell will end with a combined pipeline with steps of your selected pipelines created by the AutoAI experiment. This pipeline is the last variable named “pipeline” in the code block you see above. As you can see, the final variable on the above code block represents the final pipeline that provides us the predictions that we’re looking for so once we add our own changes into the pipeline, and then re-run it, we will accomplish our semi-automated AutoAI experiment.
Transformed Data
For this example, we’re going to work on feature selection before we re-run our experiment. Our first step is to bring back the transformed dataset from the experiment:
Feature Importance
Once we call the transformed dataset back from our pipeline and put it in the original format we fed to the AutoAI, then from there we can get the feature importance, or this can be performed in the beginning.
Once you have the feature importance (which is very useful for model explainability), you can make an educated decision around what features to select. The feature selection is made to eliminate irrelevant or noisy features, to reduce complexity or to improve the model’s performance. This is use case specific, so I will be keeping all the features for this demonstration.
As you can see above from the feature importance dataframe we just created, out of top 20 features, 10 of them are AutoAI engineered features which helped improve model accuracy.
Updating the Pipeline
Since we now know the feature importance scores of our features, and have the AutoAI-processed dataframe above with the new features, we can utilize the column selector operator of the Lale library and declare which columns we want to include in the pipeline we’re updating. For this example, I’m including all features by saying list(pd_output.columns), but you can include or exclude any columns in the code that you wish.
Once the columns are selected, we just need to update our original pipeline by adding the column selector module at the very end, right before our estimator, which in this case is a XGB Classifier. Thus, once our data goes through the steps of the pipeline and is ready to be consumed by our estimator, we can tell it to select only certain features from the transformed dataset.
Refitting the Model
With our pipeline updated, we can now use the Hyperopt library to fit our data and do automatic hyperparameter optimization, and finally get our predictions.
Fin!
With this example, we learned how to modify our machine learning pipeline generated by Watson Studio’s AutoAI. We ran the AutoAI experiment, exported our pipeline, exported our transformed dataset, selected features we want to include in our new pipeline, and re-trained our model. As mentioned in the beginning, in situations where we need flexibility around the model we’re building, the AutoAI Lale Library will be useful to help get the results we’re looking for.
For the reasons I shared above, automatic machine learning model building solutions have been an interest area for many. These solutions might be enough for someone with little to no experience with data science, but for someone who’s looking to milk every ounce of performance from a machine learning model, the capability to refine a model turns into an important feature when looking for an AutoML solution.
Part 2!
In part 2 of this post we will showcase another semi-automated approach to AutoML. Stay tuned!
Special thanks to Catherine Cao (Data Scientist, IBM) for her contributions to the code.