Watson AutoAI Series: Part 1— Introduction
Data science has been a field of computing for less than a decade, originating from an amalgam of computer science, mathematics, and statistics, and industry domain knowledge. It requires a thorough understanding of data, data pipelines, use of statistical algorithms, formulating them into mathematical optimization models, and designing them so then when deployed they are scalable, reliable, and operational to meet enterprise requirements.
In order to achieve scalable solutions in data science, certain parts of the data science workflow need to be automated. Let’s have a quick look at the data science workflow and operations involved in each step:
Data Fusion, data preparation, data generation, and feature transformation are all part of what is known as feature engineering. In the fusion stage, the required tables are selected and merged to form the final dataset which would be used for the analysis. In the preparation stage, data quality issues are resolved, such as missing value imputation, normalization, outlier treatment, etc. The processed data is then aggregated over multiple data tables (if any), categorical variables are label encoded and augmented for feature transformation. The features are now transformed using functions like sine, squared, or applied using dimensionality reduction techniques. Once the data is prepared and molded to a stage where it is a machine learning model comprehendible, it is moved to the model selection stage.
Various surveys have concluded that data scientists spend most of their time in the feature engineering pipeline which requires a good amount of understanding of the different techniques that can be applied. This is where IBM’s AutoAI capabilities can help provide value by enabling complex tasks such as data pre-processing, feature engineering, model selection, and hyperparameter optimization to be automated with one-click.
Leading data scientists recognize that feature engineering is challenging and that coming up with features is difficult, time-consuming, requiring expert knowledge. It’s basically applied machine learning.
AutoAI processing is also termed as “Rapid prototyping”, as it involves sophisticated algorithms which optimize the process of feature engineering, model selection, and hyperparameter tuning. Let’s dig a little deeper into the different stages of AutoAI.
Data Preprocessing: The first stage in AutoAI is data pre-processing. Most of the time, we face data quality issues, and in AutoAI missing values are imputed using different strategies, based on the data type and descriptive statistics of the feature. AutoAI is capable of detecting categorical variables and differentiate them from numerical features. The algorithm applies normalization and scaling operations on the features which are comparable to what data scientists do over the dataset.
Automated Model Selection: The second stage in AutoAI is automated model selection. The algorithm behind the automated model selection is the Data Allocation using Upper Bounds strategy from the research paper “Selecting Near-Optimal Learners via Incremental Data Allocation” (https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12524/11835).
In this approach, small subsets of the entire dataset are allocated sequentially to different model estimators. The aim is to find the best estimator which produces near-optimal accuracy when the estimator is applied on the entire training set while doing it in a time optimized fashion.
Automated Feature Engineering: The third stage in AutoAI is automated feature engineering. The algorithm used for the automated feature engineering is based on work from IBM research and has been published here:
1. “Cognito: Automated Feature Engineering for Supervised Learning” (https://ieeexplore.ieee.org/abstract/document/7836821)
2. “Feature Engineering for Predictive Modeling using Reinforcement Learning” (https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16564/16719).
The algorithm is built to explore various feature transformation in a hierarchical and non-exhaustive manner, while progressively maximizing the accuracy of the model through an exploration-exploitation strategy. The algorithm utilizes a “trial and error” strategy for feature engineering.
Hyperparameter Optimization
The final stage in AutoAI is hyperparameter optimization. In this step, the hyperparameters of the models chosen in the previous stage are tuned for best performance. Similar to grid search methods, the algorithm explores parameter ranges using an optimizer called RBFOpt. RBFOpt is an open-source library for black-box optimization.
More details on RBFOpt can be found here: http://www.optimization-online.org/DB_HTML/2014/09/4538.html.
In summary, AutoAI is a one-click solution that reads in a dataset and outputs the optimized pipeline of feature engineering, model selection, and hyperparameter optimization. This can be a go-to application for data scientists in having a holistic view of the feature exploration and model building.
Stay tuned for the next blog which explores a use case implemented using AutoAI.