Unlocking your data’s potential with IBM Watson Studio’s AutoAI feature engineering on relational data

Yair Schiff
IBM Data Science in Practice
5 min readMay 10, 2021
Screenshot of the user interface for setting up table joins in AutoAI

If you’ve ever spoken to a data scientist / machine learning practitioner (or are one yourself) at one point or another, you have probably had a conversation that sounded eerily similar to this:

“So, what do you do?”

“I’m a data scientist.”

“That sounds fun…What does that mean?”

“I apply rigorous mathematical and statistical analyses to make predictions learned from past data on new observations!”

“Cool! So, what’s a typical day like?”

“Well… I spend most of my time searching for, cleaning, and processing data files..”

“Oh…”

At IBM Watson® Machine Learning and IBM Research, we wanted to help you change that script and get back to the real work, which is why we are proud to announce the general availability release of our new enhancement for AutoAI: feature engineering on relational data. As part of our continuing effort to lower the barriers to adoption of state-of-the-art machine learning tools across organizations, the release of this feature builds off of the success of our award winning AutoAI offering and allows users to instantly combine and extract new features from multiple data files, all with the click of a button. AutoAI is already one of Watson Studio’s premiere offerings that automates the arduous work of data ingestion and preprocessing, feature engineering, model building, validation, and code generation. Relational data feature engineering is a new enhancement to this product that helps take your analyses to new heights.

Long gone are the days of siloed data, where disparate parts of an organization each maintained their own information, with no need to share or combine for analysis. In today’s modern business settings, the full benefit of data science technologies cannot be realized without the ability to combine, process, and extract new information from data housed across an organization. AutoAI’s new feature engineering on relational data directly addresses this use case and allows data scientists to get back to what they really want to be doing: building and analyzing high accuracy models.

Seeing is believing

If all of this sounds too good to be true, allow us to take a moment to walk through an example that highlights the power of AutoAI’s feature engineering on relational data. Even for those readers who are familiar with AutoAI, we highly recommend reading on to see all the added benefits of this new feature.

Let’s start with some sample data. Imagine you’re a data scientist working at a top outdoor equipment retail chain, “Great Outdoors” (GO), and you want to better predict sell-through data. Naturally, you want to rely not just on the tables that contain sales quantities and date, but you also want to better understand which product features, retail locations, and channels most affect your data. This information lives in different locations across your organization, so what are you to do? Well, to really get into the role, read more details about this sample use case, find the sample datasets in the AutoAI experiment gallery, and let’s see where this data journey takes us:

Screen recording of adding a new AutoAI experiment to a project and selecting the Go Sales sample from the Gallery samples.
Adding the Go sales sample data as a new AutoAI experiment

We begin by adding the data assets to our AutoAI experiment. Once multiple files have been added, relational data feature engineering kicks into gear and users are prompted to walk through a friendly user interface that enables seamless data joining.

Screen recording of adding multiple data assets to an AutoAI experiment and setting up the table joins.
Adding multiple data assets to an AutoAI experiment and setting up table joins

Taking the place of complex relational database joins and preprocessing, this friendly user interface allows you to easily combine data, with helpful features like suggested join keys.

Once our data is ready, we can further customize the experiment. For example, by navigating to the experiment settings, we see an option to designate certain columns as timestamps, and the ability to set a sliding window, which controls how long in the past or future to look when joining two time-dependent datasets.

Screen recording of enabling timestamp thresholds in experiment settings.
Screen recording of enabling the sliding time window in experiment settings.
Enabling timestamp threshold (left) and sliding window (right) in experiment settings

With our settings complete, we hand the experiment over to AutoAI to sprinkle some machine learning magic on top of it.

The first step in the process after the data has been ingested is data joining and “join feature extraction”, which means that AutoAI is not just combining our data assets according to the flow we designated earlier, but it is also creating new features for us as it does so. By analyzing correlations with our prediction variable, looking for redundancies, and searching through a space of possible new features, AutoAI produces a brand new aggregate dataset on which the machine learning algorithms will run.

Screen recording of AutoAI performing table joins and relational data feature engineering.
AutoAI table joining and feature engineering on relational data

Once relational data feature engineering is complete, the core AutoAI process takes over with a cutting-edge pipeline generation process that involves optimal data allocation to different algorithms, additional feature engineering, and hyperparameter optimization.

Screen recording of an overview of the full AutoAI pipeline generation process.
AutoAI pipeline generation process

And just like that, with a few clicks through and intuitive user interface, we’ve built a highly accurate model on several datasets.

As part of Watson Studio and Cloud Pak for Data™, the generated AutoAI pipelines have access to the full suite of data science lifecycle solutions offered on these platforms. For example, once the desired pipeline is selected, it can be saved and deployed for scoring.

Screen recording of saving a pipeline from an AutoAI experiment as a model.
Screen recording of promotion of a saved model to a deployment space.
Saving an AutoAI pipeline as a model (left) and promoting it to a deployment space (right)

By automating the difficult steps that often serve as a barrier to machine learning adoption, AutoAI’s feature engineering on relational data further makes Watson Studio the premiere destination for data scientists and organizations looking to apply the power of AI to their data.

Get started with AutoAI and relational data feature engineering today by visiting Watson Studio.

Happy modeling!

--

--