How to Perform Guided ML When AutoML is Not Enough Part 2

Catherine Cao
IBM Data Science in Practice
7 min readAug 18, 2021

This post serves as a follow-up for our discussion on AutoAI, which provides a fully automated approach to build Machine Learning (ML) pipelines.

a lot of different colored umbrellas in the air
Photo by Ulises Baga on Unsplash

In our previous post,we discussed how to do create and customize the notebook generated by AutoAI using the Insurance Cross Sell Prediction dataset (See Part 1 of the series by Yuce Dincer). The hypothetical situation we are continuing with in this post is that an insurance company wants to examine which customers who currently have health insurance policies would also be interested in the vehicle insurance policy the company has to offer.

In this post, I am going to use the same dataset from the previous post and here I will showcase a guided ML approach with Modeler Flow. Modeler Flow is a visual modeling tool, which by itself allows users to perform data analysis and build ML models from end to end. Additionally, I will also discuss how users can flexibly use some of its functionalities to make the AutoAI workflow more efficient.

the source node connects to the type node, which then goes to a table, 12 fields, and two response nodes. One response node connects to an auto data prep node that connects to a table and a partition node. This partition node connects to multiple response nodes that eventually all produce an analysis node.
High-level view of the Modeler Flow end-to-end

For this overview, I am using Modeler Flow in IBM Cloud Pak for Data V3.5. If you would like to follow along, I assume you already have an analytics project created in your Cloud Pak for Data account. Basic modeling knowledge is recommended but not required.

A shot of choosing an asset type in Cloud Pak for Data that includes Modeler Flow, Connection, Dashboard, Metadata Import, Connected data, Model from file, data refinery flow, autoAI experiment, federated learning, and decision optimization.
Modeler flow is now a part of Cloud Pak for Data

Import training data

Modeler Flow allows users to quickly develop machine learning models by dragging and dropping “nodes” into a canvas. Each of the nodes represents a type of data operation or a machine learning technique. When connected together, the nodes form a stream. When users run or execute the stream, data flows from one node to the other, and each operation will be applied to the data.

In Modeler Flow, nodes are classified into different tabs based on their functionalities. I used the Data Asset node under the Import tab and pointed it to the train_sample.csv, which is a random sample of 10k records from the original Train.csv. I trimmed down the data size for a quicker turnaround for this experimentation.

a table with the columns “AutoAI — insurance” and “Data assets”. The AutoAI column has the items “Assets(2)”, “Connections”, and “Data assets”. The “Data assets” column has “data assets(2)”, “AutoAI Notebook Customization”, and “train.csv”. “connections” points to “autoAI Notebook customizations” and the “data assets” in the “AutoAI-Insurance” column points to “train.csv”

Defining Data Types

The first node past the Source node is the Type node. It is a perfect opportunity to have a first glance through the columns you have in the dataset. Here, you can take a look at the column names, measures (data types), the column’s role in the model, and typical values in the specific columns.

Modeler Flow automatically profiles the data types based on the column values it sees. But you can change the data types if needed in the Type node.

screenshot of what shows when you look at the type node. it shows settings such as default mode and type operations.
Type Node

Defining the right data type for each of the columns is very critical as this will have an impact on the downstream workloads such as handling missing values, feature scaling, and everything else. For example, the strategy we use to handle missing numerical values could be using the mean, but it won’t make sense to use the same strategy to handle categorical variables. So we need to differentiate features by their measure level and data types so that features of same data types can be handled properly as a batch in the pipeline.

Feature Importance

Selecting a subset of features based on their importance towards predicting the target variable is one method of feature selection. To learn more about this, check out this post by Machine Learning Mastery for a great summary of Feature Selection techniques.

Feature Selection can be done very quickly by using the Feature Selection Node. It will rank all the input variables based on their importance.

feature selection output table. the features it lists as important are vehicle_damage, previously_insured, policy_sales_charge, vehicle_age, region_code, age, and gender.
Feature Selection Node Output

According to the output, seven out of ten input variables are identified as important, and we will select those for the modeling.

Auto Data Prep

Automated Data Preparation node is one of my favorite nodes in Modeler Flow. It will analyze the data, and based on the analysis, it will:

  • Filter out non-informative fields,
  • Derive new attributes when appropriate, such as by extracting new features from data or time features,
  • Prepare input and output fields for modeling including adjusting data types, handling missing values and outliers, and feature scaling,
  • Feature engineering such as merging sparse categories, binning continuous fields, and selection.

The dataset we are working on is quite clean so we just need the Auto Data Prep node to rescale the continuous features and encode the categorical variables.

the dataset refactored: gender is changed from strings of “male” and “female” to 1 and 0, age is transformed from actual ages: to scaled real numbers with several decimal places, and region codes are reindexed
Before vs After Auto Data Prep: string categorical variables (Gender) are encoded, continuous variables (Age) are rescaled and numeric categorical variables (Region) are reindexed.

Auto Classification

Now the data is ready for modeling. It is split into 90% training and 10% testing sets by the Partition node.

the settings tab on the partition node which shows the partition field as “partition”, the partitions field with “train and test” selected, the training partition size field as “90” with label “training” and value “1_training”. The testing partition size field as “10” with label “testing” and value “2_testing”. The validation partition size field is greyed out. The total size is shown as 100%
Partition Node settings

An Auto Classifier node is then applied. It will go through a number of classification algorithms and present you the ones that perform the best.

I limited the experiment to these three algorithms: logistic regression, XGB linear, and random forest. The result below shows that the random forest classifier yields the best accuracy and AUC (Area Under Curve) values.

The output — random forests shows a max profit of 3211.1, another max profit of 9, lift of 3.318, overall accuracy of 94.773, and auc as 0.98, logistic regression with max profits of 0, lift of 2.697, overall accuracy of 87.793, and auc of 0.862, and xgboost linear with max profits of -50 and 0, lift of 2.694, overall accuracy of 87.793, and auc of 0.861. All three have build times of less than 1 second and 7 fields.
Auto Classifier output

If all models are selected, ensemble learning will be enabled. In this case, I chose to use confidence-weighted voting and if voting was tied, I selected to just go with random selection.

With confidence-weighted voting, the votes are weighted based on the confidence value for each prediction. For example, for a specific record, Model 1 predicts no with confidence value of 0.8, and Model 2 predicts yes with confidence value of 0.4 and and Model 3 predicts yes with confidence value of 0.2, then for this record, the average confidence value for the no is (0.8 + 0.6 + 0.8) / 3 = 0.73 and (0.2 + 0.4 + 0.2) / 3 = 0.27 for the prediction of yes. So the final prediction will be no for this record.

“ensemble method for flag targets” — “confidence-weighted voting” is shown with the choice of “if voting is tied, select value using”. The choices are “random selection”, “highest confidence”, and “raw propensity”. “random selection” is chosen.
Settings for Ensemble Methods

The following shows the coincidence/confusion matrix. The model is not producing a lot of 1s (only 5 cases are correctly predicted as 1, as seen in the lower right side of the coincidence matrix), which in this case is existing customers who might be interested in another offering. This is not ideal because less 1s means we have less customers to cross-sell to.

label of “Response x $XF-Response” table has response “0” with field “0” with value 850, and field “1” with value 10. response “1” with field “0” with value 124, and field “1” with value 5. also says that cells contain cross-tabulation of fields (including missing values), chi-square equal to 5.528, df equal to 1, probability equal to 0.019
Confusion Matrix based on cut-off point of 0.5

One solution is that we can decrease the cut-off point. The cut-off point for binary classification is the the dividing point at which we decide if a prediction is positive or not. All predicted outcome with a probability above this threshold will be classified as positive (1) and the other as negative (0). By default the cut-off point for binary classification in Modeler Flow is 0.5.

Decreasing the cut-off point will increase the number of positive cases (1s), even though it will increase the number of false positives as well. In this case, false positives are customers who might not accept a cross-sell offer. Since our goal is to reach as many prospects as possible, this trade-off is acceptable.

I used the Analysis node to perform error analysis, and it shows that by decreasing the cut-off point from 0.5 to 0.315, it will increase accuracy by 2 times. Below is the confusion matrix after the adjustment. We got more accurate prediction of positive cases(customers who are likely to accept a cross-sell offer). It is increased from 5 to 75, which makes more business sense as now we can reach more customers.

label of “View Output: Response x Adj_pred” table has response “0” with field “0” with value 705, and field “1” with value 155. response “1” with field “0” with value 54, and field “1” with value 75. also says that cells contain cross-tabulation of fields (including missing values), chi-square equal to 101.148, df equal to 1, probability equal to 0
Confusion Matrix after the adjustment

Comparison with AutoAI

The following chart shows how different perspectives of the model building process are supported by Modeler and AutoAI.

AutoML in Modeler vs. AutoAI

AutoML in Modeler vs. AutoAI

I recommend Modeler Flow if you want more control of how your models are being built without going too much into the details of the Python code generated by AutoAI. You will still get some level of automation, however, that will expedite the model building process via automatic data preparation and auto model selection, among other things.

Since Modeler Flow and AutoAI are both available in Cloud Pak for Data, you can get creative with combining these two solutions. You can use Modeler Flow for exploratory data analysis, use the the Feature Selection node to narrow down the scope of your features and only provide AutoAI the important features. You can then examine if more extensive feature engineering will boost the model performance significantly. This technique can be extremely helpful when you have more than enough input variables.

Conclusion

In this article, I walked through how to perform guided ML using Modeler Flow and compared it with the fully automated approach. It is always a good idea to choose the right approach that works the best for what you want to achieve, and don’t be afraid to experiment with a combination of different tools as it might help you get things done more efficiently!

Both Modeler Flow and AutoAI are available in Cloud Pak for Data as part of Watson Studio, and I encourage you to try them by signing up here!

--

--