Automated feature engineering for relational data with AutoAI

Published in

IBM Data Science in Practice

6 min readMay 6, 2021

Feature engineering is one of the most important and tedious tasks in data science, especially when the input data is provided in a relational format with multiple tables. For that type of data, data engineers/scientists spend a lot of time writing and debugging SQL queries to join and aggregate the data. Typically, it takes weeks or even months to complete this task due to many options for data joining and aggregation. In this article, we will learn how to perform feature engineering automatically with IBM AutoAI in Cloud Pak for Data. With the help of AutoAI, this task can be done in minutes, saving you weeks while achieving top performance results.

Use-case and dataset

Our example use-case concerns sale prediction for every product across retailers from the “Great Outdoors” (GO) company. The GO data includes five tables with the entity relation diagram shown in the following figure:

Entity relation diagram (ERD) of the GO dataset. The main table in orange has the prediction column QUANTITY. Other tables are called contextual tables connecting to the main table with foreign keys.

Go: the main table with a prediction target column QUANTITY and a DATE column which indicates the cutoff time when prediction should be made.
Daily sales: the GO company has many retailers selling its outdoor products, the daily sale table is a time series of sale records where the DATE and QUANTITY column indicates the sale quantity and the sale date for each product in a retail store.
Products: keeps product information such as product type and brands.
Retailers: keeps retailer information such as retailer names and countries.
Methods: this table keeps order methods such as Via Telephone or Online.

These tables are connected by foreign keys. A connection can be one-to-one, i.e. every record in the left table has a foreign-key value matches with at most one record in the right table or one-to-many, i.e. one record in the left table matches with multiple records in the right table. For example, the connection between the Go table and the Product table via the Product number key is a one-to-one relation, while the connection between the Go table and the Daily sales table via the Retailer code and the Product number is a one-to-many relation.

Feature engineering for relational data

In the GO use case, data scientists collect contextual information for every pair of products and retailers before building a predictive model. This process is called feature engineering including two basic steps:

Table join: join the main table with contextual tables
Data aggregation: turn the joined table into features.

For example, joining the Go table with the Products table (via the Product number foreign key) and the Retailers table (via the Retailer code key), we can bring static information such as Product type (from the Product table) or Retailer country (from the Retailer table) to the Go table. Doing this gives a useful feature for predicting sale: a product type such as garden versus hiking or geographical information such as countries is a useful predictor of sale. In this simple case, aggregation is not needed because the relations between these tables are one-to-one.

On the other hand, a more interesting source of information comes from the Daily sales table with dynamic information such as daily sales and unit price. Joining the Go table with the Daily sales table brings sales and unit price history for each product and retailer. We need to transform the historical data with aggregation functions to obtain the features. The main question at this stage is what aggregations will result in useful features. Sometimes, the answer to that question is based on domain expertise, but in many cases, data exploration is required to gain an insight into the patterns hidden in the data.

Automated feature engineering with AutoAI

Please check the full video demo to learn about the basic configuration steps to run an AutoAI experiment with the GO dataset. AutoAI provides an intuitive and efficient way for data scientists to perform feature engineering for relational data. Entity relation diagram declaration is done by dragging and dropping tables to a canvas and adding connections between them. The UI is designed to allow people to play with different configurations efficiently. Also, it comes with interesting features like an automatic key suggestion, support for multiple-key and multiple connections between tables.

A screenshot of ERD declaration for automated feature engineering in AutoAI. — ERD declaration

AutoAI automatically performs table joining and chooses appropriate aggregation functions which can be explored during the runtime in the UI. Below is a list of time series aggregations applied to the joined result between Go and the Daily sales table.

Once the join has been done, applied tranformations is displayed for the given join between two tables as in the screenshot of AutoAI. — Once the join has been done, applied tranformations is displayed for the given join between two tables.

Another important feature in AutoAI is the declaration of timestamps. For example, for each product in a retailer shop, we may need to predict the sale quantity on a daily or weekly basis. To do so, we can explicitly provide a column called cutoff timestamp in the main table indicating the time of prediction. Providing this information not only helps AutoAI extract calendar-based features, but also avoids data leakage that happens when the extracted features include information about the target. On the other hand, if contextual tables have timestamps such as the DATE column in the Daily sales table, we should explicitly provide that information to AutoAI because it will exploit the temporal order of the data to generate useful features concerning regular/irregular time series or event sequences.

A screenshot of declaration of timestamps to help AutoAI avoid data leakage and generate useful features from timeseries or event sequence data. — Declaration of timestamps to help AutoAI avoid data leakage and generate useful features from timeseries or event sequence data.

Feature exploration and insights

When the experiment succeeds, the best machine learning pipelines are provided on the pipeline leaderboard.

A screenshot of the leaderboard with top machine learning pipeline found by AutoAI. — Leaderboard of machine learning pipeline ranked by RMSE

We can choose any pipeline and explore the feature importance plot to gain some insights into the main factors used by the model to make the prediction. In the example below, we learn that the most important feature for sale prediction is the lag 1 value (the most recent value) of the Quantity time series in the Daily sales table. This makes sense because there is a strong autocorrelation in the data where the sales quantity of the next day is highly correlated with the sales quantity of the previous date. AutoAI creates this feature automatically by joining the Go table with the Daily sales table using Retailer code and Product number keys and aggregating the joined data using the named as recent_0 transformation to get the lag 1 value as shown in the figure below.

A screenshot of AutoAI feature importance plot. The best feature is generated by considering the lag-1 of the sale quantity time series — Feature importance plot shows the most important features according to the impurity score measuring the contribution of individual feature to the model prediction. The best feature is generated by considering the lag-1 of the sale quantity time series

Another interesting feature is created by joining the Go table with the Daily sales table using Retailer code and Product number keys, and aggregating the joined data using the mean transformation, which conveys information about the moving average of the sale in historical data.

A screenshot of AutoAI feature importance plot. The second best feature is the moving average of the sale quantity time series, generated automatically by AutoAI — The second best feature is the moving average of the sale quantity time series

Deploy and batch scoring: when we are happy with a pipeline, we can decide to deploy it into production for batch scoring or iterate through the process a few times to improve the results based on the insights we gain so far. With AutoAI, data scientists are now free from tedious tasks, and the entire process from raw data to production is reduced to minutes instead of days or even weeks with very complex data.

What’s next?

You’re encouraged to try another interesting customer experience analysis with call center data after following the tutorial on our documentation page and the demo video.

Additional resources

For technical details about the algorithms and comparative studies please refer to the following publications:

Hoang Thanh Lam, Johann Michael-Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai and Oznur Alkan. One button machine for automating feature engineering in relational databases. Arxiv 2017.
Hoang Thanh Lam, Beat Buesser, Hong Min, Tran Ngoc Minh, Martin Wistuba, Udayan Khurana, Gregory Bramble, Theodoros Salonidis, Dakuo Wang and Horst Samulowitz. Automated Data Science for Relational Data. Demo paper at IEEE ICDE 2021.
Automated feature engineering for relational data with AutoAI cheatsheets https://developer.ibm.com.