IBM’s AutoAI at work: two real-world applications

Introducing Automated Machine Learning and AutoAI

9 min readMay 1, 2020

Data became ubiquitous some years ago, and it is now evident that being a data-driven organization is not an option anymore. Whether it is to optimize supply chains and production processes or to launch more powerful customer retention and marketing campaigns, data is meant to become the fuel for decision making — and it already is in many companies. However, it has also become very clear that the amount of data and the demand for data-driven strategies have grown faster than Data Science teams.

The concept of Automated Machine Learning is born out of the need to tackle bottlenecks that result from the scarcity of data skills in many companies. It refers to the automation of many manual processes in the lifetime of a machine learning model, from data wrangling to production. In other words, Automated Machine Learning is AI for AI. Fully integrated within Cloud Pak for Data and IBM Watson™ Studio, AutoAI is IBM’s automated machine learning tool. AutoAI does in minutes what would typically take from days to months to whole teams of data scientists. This includes data preparation, model development, feature engineering, and hyper-parameter optimization.

***Figure 1 — The role of AutoAI in a typical Data Science project***

What we do here

Complementing IBM’s published material on AutoAI (great stuff here and here), in this post we showcase how to use this tool to supplement and better exploit the data scientist’s experience and field knowledge. For the purposes of our analysis, we will be working with the datasets provided in two Kaggle competitions. These relate to industries and real-world problems where AutoAI is most relevant: fraud detection in card payments (link) and identification of vulnerability to malware attacks (link). The use of datasets from Kaggle competitions has two main benefits:

First, it provides a fair and objective method to measure the performance of our solutions, as we do not control the evaluation of the results on the Kaggle test set. In other words, Kaggle is some sort of “independent referee”. On top of this, we will have the Kaggle leaderboard to measure the quality of our submissions.
Second, we can benchmark the tool against real-world datasets. These are currently being used in their respective industries to identify problems and come up with solutions, rather than using ready-made, demo-style datasets.

In both cases, our results suggest that a combination of field knowledge and AutoAI model performs better than the two approaches (manual and pure AutoAI) separately.

Our methodology is fairly straightforward and consists of the following steps:

First, we follow a “standard” data science approach. We apply several combinations of feature engineering techniques with different degrees of complexity: from basic NaN imputation and encoding to domain- and dataset-specific transformations, as well as several machine learning models. When we are happy with our work (most likely after many iterations), we generate and submit our predictions to Kaggle.
Second, we load the Kaggle training set into AutoAI and let it do its magic. Then, AutoAI will come up with four pipelines consisting of different combinations of feature transformations and hyperparameters for the model that best fits the data. We can then choose whichever pipeline we prefer (AutoAI will give us metrics to compare them) and predict the targets of Kaggle’s test set, whose real values we can’t see. There are several ways of doing this: we can save the pipeline as a model on IBM Cloud and then input it the test set or generate a Jupyter notebook and produce the predictions via code — we use the second option. After this, we are ready to submit our predictions to Kaggle.
Third, we proceed to mix the two previous steps. We take the feature engineering and transformations from the first step and we apply the machine learning model form the AutoAI pipeline. If we want to make sure that the model does not overfit the data, we can apply early stopping when we train it. After training the model on this data, we are again ready to generate and submit our predictions. While the optimization problem that yielded the pipeline did not include these transformations (but rather used others), we have seen that in both cases this led to better results on Kaggle.

***Figure 2 — AutoAI generated pipelines***

IEEE-CIS Fraud Detection

The IEEE-CIS Fraud Detection competition was organized by the IEEE-Computational Intelligence Society in partnership with Vesta, a payment services company. It was open during the second half of 2019 and the aim was to identify fraud in credit card transactions. Consequently, the problem was framed as a binary classification problem (whether or not the transaction was fraudulent). With a training set of 433 columns, including the target, and just over 0.59M rows, each observation represents one single transaction. The test set has a similar size (though it obviously doesn’t include the target variable). The dataset is highly imbalanced, with only 3% of entries labeled as fraud.

We apply three sets of feature engineering techniques, each including a further layer of complexity to the previous one. As a result, this yields three datasets with varying levels of richness of the feature engineering.

The first set is very basic and includes dropping columns with too many null values, encode string variables, and filling null values with an integer (-999).
The second one encompasses the first set and drops redundant columns, extracts timestamps from a time variable (such as day of the week, of the month, etc.), and separates the decimal and integer parts of the transaction.
The third one adds on top of the second set, and includes some concatenation of columns, adds columns counting the frequency of values of other features and creates pseudo identifiers at the individual (person) level rather than the transaction. This set is largely taken from a notebook posted by one winner of the competition (here).

Given the imbalance of the target variable, we apply random oversampling to all three datasets.

Once we have our three datasets, we proceed to train several classification algorithms. These include a Logistic Regression, a Support Vector Machine, a Random Forest, an XGBoost, and a Light GBM. We then predict the target values on the Kaggle test set, and we are ready to submit to Kaggle.

In the case of AutoAI, we just need to upload the original dataset to our project in Watson Studio or Cloud Pak for Data, start a new AutoAI experiment with it, and select the target variable. When it finishes, based on the criteria and metrics that AutoAI has computed, we can select the best pipeline of the four that have been generated and predict the target variable on the Kaggle test set.

***Figure 3 — Model evaluation of AutoAI’s best pipeline***

Finally, we take the Light GBM generated by AutoAI and manually train it on the three datasets that resulted from our manual feature engineering process. We also apply early stopping during training.

Figure 4 below shows the ROC-AUC of the models generated using the three methods described above. The ROC-AUC is the public score that the submission gets in Kaggle, which may differ from the official leaderboard score [1]. For reference, the winning team got an official score of 0.945884 and a public score of 0.961812.

***Figure 4- IEEE-CIS Fraud Detection competition. Dispersion of ROC-AUC for different combinations of FE and ML models***

As we can see, the manual process can yield very disperse and diverse results (as many as data scientists at least!). It is the source of some really poor models (ROC-AUC of 0.5) and some really good ones (0.929512). AutoAI’s best pipeline already performs quite well, with a score of 0.900256, but it is still far from the top manual algorithm in a competition where each decimal point matters (the score of the top three performers in the competition was in the same tenth of a percentage point).

As mentioned above, we achieve the best results when we combine the feature engineering based on domain and dataset knowledge and the model generated by AutoAI. In fact, the AutoAI model applied to the three datasets from the manual feature engineering process outscores the AutoAI submission and all manual models except for the top one, with a public score of 0.935207. While higher scores were achieved in the competition, these results already point in one direction: when used in conjunction with domain knowledge, AutoAI provides a much better starting point than when a team of data scientists has to build a model from scratch. Furthermore, the AutoAI submission matches (and even outscores) some of the top manual algorithms. This means that even analysts with little to no experience in Data Science can already get high-quality results by pressing a button.

Microsoft Malware Prediction

To confirm (or reject) the intuitions from our analysis, we apply the same methodology to a similar classification problem but from another Kaggle competition in a different industry: the Microsoft Malware Prediction competition. The challenge, which took place in the first half of 2019 and it was organized by Microsoft. The goal was to identify ways to predict when one of its products would be hit by any kind of malware. The dataset provided consisted of 83 columns (including target) and almost 9M rows representing a device that could have ever been hit by some kind of malware (target = 1) or not (target = 0). The test set was slightly smaller, with about 7.8M rows. Contrary to the Fraud Detection competition, the target variable was balanced.

In our manual process, we follow a very similar approach to feature engineering than in the Fraud Detection competition, though in this case, we limit the number of expanded datasets created to two instead of three. The machine learning models that we test are the same.

In the case of the AutoAI experiment, the engine returns a pipeline with a Light GBM algorithm. We use the whole pipeline, as well as the combination of the model and our feature-engineered datasets to produce our “AutoAI” and combined sets of predictions. The ROC-AUC of all submissions (the competition’s selected scoring metric) can be seen in Figure 5 below. For reference, the winning algorithm got a score in the official leaderboard of 0.67585.

***Figure 5 — Microsoft Malware Prediction competition. Dispersion of ROC-AUC for different combinations of FE and ML models***

Again, the results of the manual process can be sparse: it can yield from very dumb models that assign a probability of 0.5 to all unseen observations to top models such as an XGBoost classifier with a score of 0.68429. AutoAI can at least match the median (an already fairly complex) manual model, with a score of 0.66181. Again, the combination of manual feature engineering and the AutoAI model with early stopping yields a powerful tandem that gets a ROC-AUC of 0.67537. The conclusions are hence similar to the ones we got from the Fraud Detection competition: AutoAI provides a great starting point for analysts with little to no experience in Data Science and an even more powerful tool for data scientists with domain specific knowledge and skills.

Conclusion

In this post, we have benchmarked AutoAI, IBM’s top-of-the-class solution in the emerging space of Automated Machine Learning, against the standard “manual” approach of Data Science teams in two real-world classification problems. In both cases, AutoAI quickly proves to be a powerful tool for an array of profiles with diverse skills:

Analysts with little to no experience in Data Science can benefit from high-quality models ready to go in production with just one click of a button.
Consolidated teams of data scientists can leverage their domain knowledge and expertise and start iterating over models based on the optimized results of AutoAI. This is where AutoAI can reach its best results.

And remember! AutoAI comes as standard with IBM Cloud Pak for Data to be used and scaled across hybrid multi-cloud environments. AutoAI is also available on IBM Cloud through IBM Watson™ Studio.

[1] Kaggle assigns two scores (ROC-AUC in this case) to every submission: public and private. The public and private test sets are subsamples of the overall test set. They are also different from the subsample over which the final score is calculated in the official leaderboard. At the time of this analysis, both the IEEE-CIS Fraud detection and the Microsoft Malware prediction competitions were closed, so we do not have official leaderboard scores.

IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.

Learn more at www.ibm.com/garage