AutoAI for Data Scientists: From Beginner to Expert

Data science is a required practice for organizations accelerating their journeys to AI. Businesses are keen on hiring the right talent, acquiring the right tools and evolving the discipline. When it comes to data science projects there are two major problems:

1) There are not enough data scientists.

2) It takes too much time for any data scientist to get to a usable, tuned model.

Solving the lack of data scientists' problems requires investment in our employees in terms of time and training. We can’t expect these people to just keep on learning for a year before they can be productive. We need to reach a stage where people know enough to start contributing immediately while continuing to improve their skills.

As far as the second problem is concerned, taking too much time getting to a usable and tuned model, we need tools to help us optimize our data scientists' productivity. There are some tasks that are relatively mundane that could be automated, leaving the more challenging and interesting parts to the data scientist.

Enter AutoAI. It recently won the AIconics best innovation in intelligence automation award. Let’s talk about how it addresses our problems.

AIconic Award for AutoAI

Currently, AutoAI addresses problems related to classification and prediction (regression). These types of problems are at the core of many data science initiatives. If you are an experienced data scientist, you know how to solve them. With AutoAI in Watson Studio, you can quickly see the leaderboard of the various pipelines which help accelerate the model selection. If you are learning data science you can learn how these functions are used.

AutoAI processing and leaderboard

At the highest level, creating a model involves taking some data, passing it through a machine learning algorithm, and getting a resulting model. Well, it’s not always that simple.

Let’s say you have your data as a comma-delimited file (.csv). To start with, all the attributes are character strings. We need to identify all the fields that are numeric and convert them into integer, decimal or floating-point numbers. You also have to consider dealing with missing values and normalization.

The character fields also have to be converted to numeric values. Typically, we are talking about categorization. For example, gender, type of payment, and so on.

We must admit that this is not the most exciting part of creating a model. Being able to automate this part makes expert data scientists more efficient and helps more junior data scientists avoid mistakes while address the pre-processing of the data even if they are still learning about what needs to be done.

See for yourselves: You can try the AutoAI tutorial on the IBM Cloud for free.

Do we use a decision tree? An ensemble? There are so many to choose from. Which one is the best for the type of data and problem we have?

Curated models available through AutoAI

We also have to contend with feature engineering and hyper-parameter tuning. Which new features should you create? Based on what? This takes experience to select the right mix. As for hyper-parameter tuning, this can be tricky. You could end up with a model that works great on training data but not so much on new data. You could also end up with a less than optimal model.

AutoAI addresses all those issues and allows you to make an educated decision on which model performs best. Your decision is assisted with evaluation measures such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and others on both the training and testing data (including cross-validation). You can even see the details of how feature engineering was done and the feature importance. This is especially a key part for a beginner to start learning about data science. For expert data scientists, you can validate or adjust some of your assumptions here.

Model evaluation

Once you decide on the model to use, you can save and deploy it into an IBM Watson Machine Learning service so people can score their data through a simple REST API.

Saving an AutoAI model

A Perfect Blend of Open-source and IBM Technology

Ah, this is a proprietary solution! Not at all!

Instead of saving the model to an IBM Watson Machine Learning service, you can save it as a notebook. This way, you can generate the model yourself and decide where to save and deploy it. Since it is a notebook, you can modify it for any reason, may that be adding some transformation or make it fit datasets with additional attributes. And of course, you can use this with an open-source or Watson Studio based tool.

Generated notebook

One side benefit of generating a notebook could be for education and training. It is always instructive to see how things are done, and beginner data scientists may see some transformations they did not think about for this or other projects. This becomes learning by example. IBM is committed to leading and empowering the open-source community and data science is, of course, no exception!

We stated that two important problems we want to solve are to make the beginner data scientist productive as soon as possible and remove some burdens from the experienced data scientist so they can be more productive. With AutoAI Experiments, we remove the burden of having to deal with all the details of preparing the data. This way, a beginner data scientist does not need to know all the intricacies of data preparation right away and the experienced data scientist does not need to spend her time on mundane tasks so she can focus on higher-value tasks.

Since AutoAI can select the more appropriate model for classification or regression, automate feature engineering and hyper-parameters tuning, and provide measurements on the quality of models, data scientists can focus on the evaluation and selection of the model instead of the mechanics of creating one.

Overall, AutoAI democratizes data science and AI — data preparation, model development and selection, execution and deployment. This addresses the shortage of data scientists and gets to a solution faster. By accelerating the data science lifecycle with AutoAI, businesses can focus more on high value-added work and innovative solutions. This is why we are focused on sharing the best practices and playbook in AI. The Future of Work Webinar in data science will be more exciting and dynamic I predict.

Ready to learn more about AutoAI?

Check out this Website where we built an AutoAI playlist of videos, product tours, and hands-on lab. Or, join us at our live 3-part Virtual Data Science Camp Fall Edition starting on October 31, 2019. You can view the Summer Edition of this popular 3-part series here. If you are interested in other IBM Watson Studio-related webinar, please read the following blog.

Jacques Roy is a worldwide Digital Technical Engagement lead on Watson Studio and Watson Machine Learning from IBM where he helped build a community of data scientist followers who are brushing up their skills at all levels. He loves to talk about data science, use cases, and best practice tips

Please reach out to Jacques for any questions or comments!

Jacques Roy is a worldwide Digital Technical Engagement lead on Watson Studio and Watson Machine Learning from IBM.