Does Auto ML work better than my manually-developed model?

SRI VENKATA SATYA AKHIL MALLADI
6 min readDec 18, 2022

--

Automated machine learning (AutoML) is a process of automating the end-to-end process of applying machine learning to a problem, including the selection and optimization of features.

Does this means will AutoML save us hours from Feature selection, setting Hyperparameters, and trying to find the best Model?

To Understand it better I will be taking up a Dataset and I will analyze in both methods(AUTOML and my own model) to see which one works better and why?

For this example, I’m taking the following data set.

Which is a Binary Classification dataset, Where the classification goal is to predict if the client will subscribe to a term deposit? It has 21 columns, where Y is our to-be-predicted (or) output column and 20 input variables (or) possible features where 10 are categorical in nature and 10 are numerical.

ML Data Cleaning and Feature Selection

Firstly I cleaned my data, Removal of unwanted columns, Handled missing data, Removal Duplicates, and Managing Unwanted outliers if any.

Later with Labelencoder, I transformed Categorical variables, I used dummies as well but it created too many columns and made the model a bit complex. I found Labelencoding to be better in this example.

Feature selection: I have chosen two methods to analyze the variables present in DataSet. They are

1. Pearson Correlation

2. Select K best

With help of the above methods and also understanding the Hist plots of each variable and their distributions, I dropped a few variables and ended up choosing the rest as my features.

Modeling and Interpretability

Now I have to choose the right model for my dataset. For this step I have written a function that runs the modeling steps for me on the following models and returns Accuracy, Mean squared error, R² score, and Classification Report for each model.

1. Logistic Regression

2. Random Forest

3. Decision Tree

4. KNN

5. SVM

6. Light bgm

7. XG Boost

8.GradientBoosting

I found that Light GBMis the best of all with high accuracy, precision, and f1-score. Followed closely by GradientBoosting and XG Boost.

But mainly I will be considering the Confusion Matrix for the evaluation of the models.

Error(No)/false positives : 487/3645 = 0.1336

Error(Yes)/false negatives: 227/473 = 0.62

Hyperparameter tuning:

After tuning the hyperparameters as below, we can see a significant improvement in the Confusion Matrix.

LGBMClassifier(max_depth=20, min_child_samples=40, n_estimators=200, random_state=100, verbose=0)

Error(No)/false positives : 120/3645 = 0.0329

Error(Yes)/false negatives: 227/473 = 0.47

AutoML

Now I will be using initial cleaned data(without missing values and outliers), But I won’t remove any variables/columns. While AUTOML can do the feature selection itself, it won’t clean the data at times. In general, Auto ML is focused on automating the selection and training of machine learning models, rather than the preprocessing of the data.

Here I will be using the H2O AutoML framework, H2O is an open-source and distributed in-memory machine-learning platform developed by H2O.ai. H2O supports both R and Python. It supports the most widely used statistical and machine learning algorithms including gradient-boosted machines, generalized linear models, deep learning, and more. H2O includes an Automated Machine Learning module and uses its own algorithms to create pipelines. It uses an exhaustive search for feature engineering methods and model hyper-parameters to optimize pipelines.

AutoML has chosen the below as my model for the dataset.

H2OGradientBoostingEstimator : Gradient Boosting Machine
Model Key: GBM_1_AutoML_6_20221217_214255

This model created 70 trees and the evaluation metrics are as below.

Evaluation

We can see from the above confusion matrices we can observe that:

while the false positive error of AUTO ML(0.0655) is more than my Light GBM model(0.0329), The false negative rate is worse for Light GBM model(0.47) compared to AutoML(0.27).

Considering False Negatives we can suggest that AutoML has performed better than the model manually developed by me. It can be because of two reasons.

  1. Not being able to tune the right Hyperparameters to my model, While the hyperparameters that I have set did definitely work very well. I could have chosen better hyperparameters. (while I did use GradientBoosting in my manual model as well I didn’t set any parameters for it). Probably with the right set of parameters my manual model would have definitely beaten AUTOML’s.
  2. Better feature selection. AutoML selected the perfect set of features that work exceptionally well together.

With the right Hyperparameters, the False negatives of my model would have been low as well.

SHAP Analysis

For an even better understanding and interpretability of the above models, Let’s use shap analysis on them. We have to install and import shap.

Shap analysis of manual model:

Important features according to Light bgm

Duration, emp.var.rate, euribor3m, and campaign are the most impactful features.

From above we can see that duration is very significant, emp.var.rate, and euribor3m is significant as well. A low campaign can have a significant impact as well.

Shap analysis of manual model:

Important features according to AUTOML.

Duration, nr.employed, euribor3m, job, and month are the most impactful features.

SHAP differences:

While I dropped month and cons.conf.idx since they are not shown as important in either Pearson Correlation or Select K best. I can see that they are being impactful according to AUTO ML.

Conclusion

While in this example Auto ml performed better to obtain reduced False negatives error, this generally need not be the case. However, it is important to note that Auto ML is not a substitute for a thorough understanding of machine learning principles and techniques. Rather it can simplify the process for us, Like suggesting which model can we choose or what features are the most important ones.

Code:

References:

--

--

SRI VENKATA SATYA AKHIL MALLADI

Grad student @Northeastern University| Master's in Information Systems | Ex Senior Systems Engineer @Infosys