Is AutoML Useful for Professional Data Scientists?

George Vyshnya
SBC Group Blog
Published in
13 min readApr 22, 2021
Cover photo by Liza Summer

Introduction

As the interest in AutoML technologies grows in the industry over the last few years, we can see quite intense discussions on how useful they could be to use in the daily routines by professional Data Scientists. The opinions are quite polar. Some researchers indicate AutoML to improve productivity of Data Scientists whereas other professionals are quite sharp at stating, “There is no silver bullet for solving machine learning/deep learning with off-the-shelf algorithms. Instead, I recommend you invest in yourself as a deep learning practitioner and engineer”.

The dramatism of such a discussion can be specially highlighted if you remember AutoML technologies to be never intended to use by professional Data Scientists at the first hand. Instead, AutoML was originally invented to lower the barriers for non-ML experts to adopt good-enough ML technologies in their operations without having to hire an expensive skilled Data Scientists and/or ML Engineers.

With the context above in mind, let us review the practical cases where certain AutoML tools can be helpful in automating the daily routines of Data Science / ML Engineering professionals. The practical case studies will be provided as illustrations to the points to make, as applied to the public datasets of Kaggle Tabular Playground competitions in Jan — Mar 2021.

Classification of Modern AutoML solutions

“Balance” (painting by Alina Ciuciu, 2021)

Since there is a huge variety of AutoML tools (both freeware and commercial ones) that exist in the market nowadays, the question about AutoML being useful for ML/Data Science professionals shall read, “What kind of automation is helpful to the Data Science and ML professionals in completing their daily routines?”

The practical classification designates AutoML tools by the specific tasks they are going to automate. From that standpoint, we can talk about

  • Automated EDA tools aka Rapid EDA tools (e. g. AutoViz, Sweetviz, Pandas Profiling etc.)
  • Automated data augmentation tools (e.g. imgaug, albumentations)
  • Automated feature engineering/selection tools (e.g. tpot, featurewiz, boruta_py)
  • Automated model selection tools (e.g. auto-sklearn, xcessiv)
  • Automated model architecture search tools (e.g. darts, enas)
  • Automated hyperparameter tuning tools (e.g. hyperopt, ray.tune, Vizier)
  • Tools to automate the full ML pipelines (e.g. Google AutoML, H20 Driverless AI, autokeras, AutoViML etc.)

In the sections below, I am going to share my own experience with using some of the AutoML tools. As I go through, I will drop my opinions on what can be really helpful for professionals in the field as oppose to the less than ML-savvy users.

Rapid EDA Use Cases

“The elective affinities” (painting by Alina Ciuciu, 2020)

Rapid EDA is one of the low-hanging fruits for AutoML tools to help professional Data Scientists in their daily routines. Obviously, good enough Rapid EDA tool should meet a few basic criteria below

  • Generate a good number of insightful charts for a human professional to delve into the data insights on the immediate note
  • Operate with the high performance to really save the time of professionals on EDA while delivering the helpful visualizations out of the box
  • Be easy to use and maintain

I have already completed several solid case studies to prove usefulness of various freeware Rapid EDA tools.

Across the experiments I have made, I sticked to AutoViz over time. I found it meeting all the three basic criteria above. It was helpful for me on several real projects as well as Kaggle competitions.

As you can see, invoking AutoViz is as simple as writing the lines of code below

AutoViz has a few parameters to tune before you run it on your dataset. However, there is a comprehensive guide on how to set it up on various datasets/problems to achieve the best exploratory insights at the end.

Note: to see AutoViz in action end-to-end, you can refer to the source code in the following notebooks

Automated Feature Selection: Featurewiz

“Love thoughts and a pure soul” (painting by Alina Ciuciu, 2021)

Automated Feature selection and feature importance detection tools are extremely helpful when you try to explain your ML model or address the curse of dimensionality with your data (when you have too many features in your dataset, and the model loses its focus to deliver the most accurate prediction).

Across the years in the industry, I sticked to using an amazing freeware feature selection product — featurewiz. As oppose to available feature selection automation alternatives (Boruta, Tpot etc.), it turned out to deliver more accurate feature importance suggestions that were applicable to the weak ensembling ML algorithms (GBDT and RF models) as well some of the less complicated ML algorithms provided by scikit-learn.

At the same time, its simplicity and power made it more efficient tool to use vs. the analytical feature importance detection algorithms (see one of my earlier blog posts for more details on the analytical feature importance detection algorithms).

The additional bonus of using featurewiz is its relatively new capabilities to automate the routines in feature engineering. Not only will it help you save some time developing your feature engineering pipelines, but it will also ensure the fast execution of your data preprocessing and feature engineering flows.

Below are the step-by-step descriptions of two Featurewiz-backed feature selection experiments for the dataset of Mar 2021 Tabular Playground competition. One of them will be a considerably basic feature importance detection (without any heavy feature engineering and data preprocessing invoked against the raw dataset). The second experiment will demonstrate how featurewiz can be an instrumental assistant in the situations where you must implement complex data preprocessing and feature engineering flows.

Mar 2021 TPC: Basic Feature Importance Experiment

In this experiment, we are going to detect the feature importance for raw feature variables only.

First, we are going to read the competition datasets into the memory.

As a next step, we will pass the datasets with the raw features only to featurewiz to detect feature importance

It took it less then 3 min to run it on my local machine. The featurewiz was quite instrumental at detecting the important features as quickly as that.

As a side note, featurewiz allows you to quickly assess the impact of various category variable encoding techniques on the resulted feature importance.

Note: You can trace this experiment end-to-end by reviewing the source code in the respective repo.

Advanced Experiment: Additional Feature Engineering based on AutoViz insights and feature importance

“Ice moon” (painting by Alina Ciuciu, 2021)

In a series of feature importance experiments (with AutoViz engaged in a way described above), we got the following insights regarding the feature engineering and data preprocessing for the raw dataset of Mar 21 Tabular Playground Competition in Kaggle

1/ The following continual variables to be useful in feature importance experiment, to be left ‘as is’

  • cont1, cont3, cont5, cont6, and cont8

2/ The rest of the continual variables to be binned as follows

  • cont0 (into 4 bins)
  • cont1 (into 5 bins)
  • cont3 (into 2 bins)
  • cont4 (into 2 bins)
  • cont6 (into 3 bins)
  • cont8 (into 3 bins)
  • cont10 (into 10 bins)

3/ New groupby features to be added

  • group-by cont3 by cat2, cont1 by cat4, and cont3 by cat4

4/ New interaction cat variables (feature crosses) to be added

  • cat4 x cat18
  • cat13 x cat4
  • cat13 x cat2

5/ New interaction continual variables, then to be binned as follows

  • cont3 x cont7
  • cont3 x cont8
  • cont3 x cont9
  • cont3 x cont10
  • cont4 x cont5
  • cont4 x cont6
  • cont4 x cont9
  • cont4 x cont10

6/ Boolean cat variables to be as is

  • cat0
  • cat1
  • cat12
  • cat13
  • cat14
  • cat15
  • cat16

7/ log transform cont5, cont8, and cont7

As we are going to see, the above-mentioned feature engineering pipeline can be easily facilitated with featurewiz. In its recent versions (namely, in version 0.0.33, as of the moment of writing this blog post) there are powerful functions to help you automate all of the above-mentioned preprocessing and feature engineering steps, with just a few lines of code to compose. Let’s see it in action.

We can add the new interaction features as follows

Binning the continuous variables can be achieved as follows

Categorical feature crosses can be added as new features in the following manner

The groupby aggregate features can be added as follows

Last but not least, the easy-to-apply log transform (with all these little tricks like “plus one” on negative values etc.) can also be easily facilitated with featurewiz just with a few lines of the code

As we can see, featurewiz is quite instrumental at automating the common routinous steps in data preprocessing and feature engineering, along with its core mission to detect the important features for the ML modelling down the road.

After the above-mentioned feature engineering and data preprocessing, we are ready to launch feature importance detection with featurewiz, similar to what has been demonstrated in the basic experiment above.

It took less then 9 min on my local computer to get this pipeline and feature selection experiment running end-to-end. I should say, it is amazingly short time, factored in the complexity of the data transformations and the set of the new features we obtained.

Note: You can trace this experiment end-to-end by reviewing the source code in the respective repo

Automated Hyperparameter Tuning Tools

“Right wind” (painting by Alina Ciuciu, 2020)

Automated hyperparameter tuning tools designate one more area of the useful automation of daily routines for professional Data Scientists. They help to shrink the time spending on the tedious model parameter tuning while being more effective compared to the classical grid search algorithms.

It is especially helpful when you tune GBDT models (like lightgbm, xgboost, catboost etc.) where there is a huge number of essential hyper-parameters to tweak. Therefor doing it manually or via a classic grid search may not be the best way to spend your time.

I found it equally helpful to use hyperopt and optuna in this capacity.

Tools like hyperopt or optuna can be successfully leveraged to speed parameter tuning of both individual models and the ensembles of different learners.

Below is the case study to demonstrate how you can combine the power of ensemble learning with leveraging hyperopt to tune hyperparameters of each of three models in the ensemble.

Note: You can trace this experiment end-to-end by reviewing the source code in the respective repo

First, let’s review the targets of this ML experiment. They are as follows

  • Build the ensemble classification prediction model for the problem in Kaggle Tabular Playground Series — Mar 2021 contest, using three GBDT models (lightgbm, xgboost, and catboost) as the ensemble members
  • Automate the hyperparameter tuning of every model in the ensemble using hyperopt
  • Use AUC (Area under the ROC curve) as a model performance metric to optimize the model parameters for

From the software development standpoint, we will implement the ensemble as a custom Python class, with key methods of the scikit-learn Classifier predictor (fit, predict, predict_proba) implemented. We are also going to enable the weighted ensemble voting prediction option in case the ML modelling will justify it.

This will allow us to pass the instance of the ensemble modeler class wherever a scikit-learn Classifier object can be passed (inclusive of the search function for hyperopt).

We will rely on scikit-learn Classifier interfaces provided by maintainers of lightgbm, xgboost, and catboost libraries (rather than their native interfaces), for unification reasons and simplicity.

Our next step would be building the hyperopt search function as per the code fragment below

There are several highlights on the search function implemented

  • Since we optimize the model for the best AUC value (that is, to ensure the maximal value of AUC in the search rounds of hyperopt), we are forced invent the artificial metric — negative AUC (as hyperopt search works to minimize the loss function specified)
  • The parameter space passed to the search function of hyperopt should specify the relevant attributes for all of three models used in our ensemble (that is, lightgbm, xgboost, and catboost)

The appropriate parameter space is facilitated via the special naming convention we adopt. Such a naming convention is supported / ensured by the constructor of our custom ensemble class (see above). As a result, we have to prepare the dictionary with the hyperopt search space attributes in a special way, as displayed below

Once hyperopt detects the optimal set of hyperparameters for each model in the ensemble, we do the usual model prediction as per the scikit-learn contract.

On the Dark Side of AutoML

So far so good? Does it sound like AutoML is going to be the paramount to the professional Data Scientists and ML Engineers? Unfortunately, not everything seems to be so bright. We can see certain drawbacks if we touch two more types of AutoML products

  • Automated Model Architecture Search Tools
  • Tools to automate the full ML pipelines

I am going to review it in the next blog post of this series. At the same time, the next post will cover the eternal myth on AutoML — the one that tells about AI and AutoML to replace the highly skilled (and often highly paid) ML Engineers and Data Science professionals in the middle or long run.

AutoML Impact on ML and Data Science Industry

“I’ll bring you to Venice” (painting by Alina Ciuciu, 2021)

Despite the original intention of AutoML to serve the needs of non-Machine Learning professionals, it will affect (and is affecting) the experts in the field as well.

Some gurus on the pessimistic side predict more automation in the field is going to shift the focus away from programming skills in favour of competence in statistics and research methods. Conversely, the alternative viewpoint proclaims AutoML would never beat the skilled ML professional with the expertise in the appropriate technologies and the domain knowledge about the data he/she works with.

I tend to second the second point of view myself. I therefore use some kinds of AutoML instruments for what they are (that is, auxiliary tools), with no expectation to have a magic wand to solve any ML/DL problems for me. I also put the continual effort to fill my own toolbox with additional knowledge.

As shown in the sections above, some of the use cases where you can leverage AutoML tools embrace the scenarios below

  • Utilizing Rapid EDA tools (AutoViz or similar) to speed up getting the insightful visualization of the basic statistics and variable interactions/associations in your datasets
  • Automating the Feature Engineering Feature Selection with featurewiz
  • Automating the ML Model Hyperparameter Tuning with tools like hyperopt or optuna

Obviously, junior Data Scientists / ML Engineers still must learn what is under the hood with the methods being automated with AutoML tools above. In particular, it implies trying to do everything mentioned above manually, using the standard tools offered by pandas, scikit-learn and other lower-level Python libraries for ML and Data Science couple of times. In such a way, you will actually comprehend what’s going on behind the AutoML magic as well as control what tools you use to tackle a specific problem on your side.

After that you can fearlessly leverage AutoML capabilities to safe some time on doing routine operations for the sake of doing more complex problem resolutions or unstructured problem solving.

However, regardless the involvement of model and pipeline automation tools beyond the model training, the ML field is going through an explainability crisis now. That is where the real opportunity for ML professionals/Data Scientist is.

If you can establish yourself as a data professional that can understand the datasets you are working with (data cleaning, identifying data leaks etc.) as well as create models that are explainable (or/and statistically sound), you can find yourself on the right track. Creating explainable ML models, however, embraces not just using the ‘explainable’ ML algorithms alone. Nowadays it is also about using the analytical feature importance methods or tools like SHAP that can bring explainability to any modern model trained with ‘non-explainable’ yet powerful algorithms (like modern neural networks, GBDT variations etc.).

References

If you like to delve deeper in AutoML technologies deeper, you are welcome to continue your research with the resources below

You can refer to the public datasets for the respective Tabular Playground competitions in Kaggle in Jan — Mar 2021 per the links below

You can find my repositories with the code for the various experiments with the datasets of these competitions by navigating to the github repos below

Cover photo by Liza Summer (https://www.pexels.com/photo/woman-writing-at-table-with-laptop-and-ring-lamp-6348142/)

As the section title images, the paintings by Alina Ciuciu (https://artedialina.com/) have been used.

--

--

George Vyshnya
SBC Group Blog

Seasoned Data Scientist / Software Developer with blended experience in software development, IT, DevOps, PM and C-level roles. CTO at http://sbc-group.pl