Facing the Roadblocks of Machine Learning with AutoAI

Published in

IBM DCPE Group

7 min readJun 24, 2020

Over my years of working with clients, it’s become more common to hear “AI”, “innovate”, and “understand our data” in the same discussion; often in the same sentence! It’s important to figure out why: the amount of data that organizations collect is constantly growing, and the more it grows, the more unwieldy it becomes. Humans are fantastic at understanding patterns, but we’ve never been tasked to grasp this much information in such a short amount of time.

Therefore, the natural conclusion that many have is to leverage AI & machine learning to make sense of it. From predictive maintenance and IoT, to customer segmentation and fraud detection, there are many areas to leverage AI and tremendous value to gain.

The roadblock?

At the heart of AI and machine learning is data science. To keep it straightforward, a useful and successful machine learning model necessitates the steps of the “data science/AI lifecycle”.

You may see other versions of this lifecycle, such as CRISP-DM. Any version of this lifecycle will essentially contain the following steps: business & data understanding, data preparation, model (or multiple models/ensemble) training, validation & evaluation, and deployment. This lifecycle is not necessarily linear, either. For example, when data is prepared and cleansed, there is a possibility that the use case doesn’t align with the data that’s available — so the use case must be tweaked.

After formulating the use case and gathering the appropriate data, the challenge appears:

What tools do we use to make sense of the data and to build models?
What if I don’t have experience with leveraging languages such as Python and R?
How do I make sure I’m doing all the data prep I need?
How do I know I’m using the appropriate algorithm for my data?
How do I deploy and integrate this model into my current application or business process?

These are all extremely valid and important questions to ask, which understandably cause a lot of hesitation for organizations to begin leveraging AI/machine learning. The understanding of the data and the business are there — do we have to leverage a full data science team who knows Python from the get-go just to start getting any benefit?

Not necessarily. Let’s discuss IBM AutoAI.

AutoAI: Leveraging best practices with automation

AutoAI bridges the gap, end-to-end, between data selection and model deployment. Once you have the historical data used to build a model, simply feed it into AutoAI and tell it what you want to predict. From there, AutoAI does the heavy lifting.

What does AutoAI automatically do? Again, it’s intended to automate the lifecycle:

Data preparation (missing value imputation, feature encoding, feature scaling, etc.)
Model selection (testing and ranking candidate algorithms)
Feature engineering (feature combinations to best represent the problem)
Hyperparameter optimization (refining the best performing model pipelines, converging to a better solution)

Sound familiar? These steps are crucial to the AI lifecycle (shown in Figure 1). If you and your team are able to tackle varying use cases and build models using an automated platform, while knowing it adheres to best practices of data science, then imagine the possibilities! And with our Watson Machine Learning component, any of these models can be deployed and put into production very quickly.

By automating tasks that typically take days to weeks (assuming an already-established data science team), what would that mean to your organization? What use cases are you thinking of tackling? What data sparks in your mind as a good candidate to begin model-building? With AutoAI, these curious questions can be answered and validated in an efficient (minutes vs. months) way.

An example: Detecting occupancy

Suppose we are an organization in the HVAC industry that is looking to leverage data coming from various sensors to better cool or heat office rooms. High energy consumption, and related bills, can occur by running air conditioning or heating unnecessarily, especially if people aren’t in the room in the first place! How can an organization better tackle this?

A first step would perhaps be to understand and predict the likelihood of occupancy in a room using sensor data. This can drive how often a HVAC unit should be running. We may already have some historical sensor data where we know whether or not there was occupancy in the room. For this example, I’ll be pulling from the UCI Machine Learning Repository¹, specifically the Occupancy Detection Data Set².

We now have this data — what do we do with it? Leveraging AutoAI, we’ll be able to build a model and deploy it, all in an automated fashion.

Taking a look at the historical data we’ll use here (Figure 3), we can see the various sensor data inputs (such as temperature, humidity), as well as occupancy (a value of “1” meaning people were in the room, and “0” meaning it was empty). Data gathering is the extent of our human work in this example, as AutoAI will take it from here!

Figure 4: AutoAI experiment configuration

Configuring the AutoAI experiment is very straightforward. Figure 4 highlights the steps — we select our dataset, choose what we want a machine learning model to predict (occupancy in this case), and AutoAI will automatically select the appropriate settings and metric to focus on. You can of course configure these settings if you wish.

The fun and insights really begin here! The relationship map in Figure 5 provides an overview of the many model pipelines that are automatically built, which consist of any feature transformations to the data, the top performing algorithms, and any sort of feature engineering/hyperparameter optimization needed. AutoAI then ranks the pipelines in an easily consumable format. In Figure 5, it appears that Pipeline 7 ranks the highest on the leaderboard. Let’s take a closer look.

Figure 6: Model information for AutoAI pipelines

There are a few pieces of information to take in from the model evaluation in Figure 6, but high level, this is a pretty accurate model. We can also view information such as confusion matrices, precision-recall curve, model details, the specific feature transformations, and feature importance. These details are vital to business analysts and data scientists alike!

Suppose we feel comfortable in leveraging this model pipeline for a future application. What can we do with it? There are two interesting ways to proceed.

Figure 7: Saving a model or converting to a notebook

We can either save this pipeline as a model, which we can then deploy, or save it as a notebook to obtain the source code. Data scientists may find the second option quite interesting — while AutoAI is speeding up the process to model creation, it is built on open source. This is revealed through this option!

For this example, we’ll go for the first option. In the platform, we can then easily save and deploy this model as an online web service to which we can make an API call.

The above figure showcases a deployed model. Given these particular inputs (which may be coming from sensors in the room every few minutes), the room is predicted to be occupied with a probability of 99%.

In a matter of minutes, we were able to take historical data, run it through AutoAI, and deploy a model for future use! We can certainly build on this with more data points and other use cases.

In conclusion…

AI and machine learning can seem intimidating, and there are typically roadblocks around skill sets and best practices in data science to navigate. With IBM AutoAI, you are able to automate and adhere to the AI lifecycle of data preparation, model selection & development, and deployment. All while keeping it explainable to both business users and data scientists.

Let me know if there are use cases you can think of testing with AutoAI! I’m always interested to hear about how others are using it.

Want to start?

AutoAI is within our Watson Studio platform. Start on IBM Cloud for free.

Contact: Ahmed Abdellatif

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[2] Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, VÃ©ronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28–39.