Understanding the MLJAR AutoML framework

experimenting with MLJAR on a dummy dataset

Published in

Data Science in your pocket

11 min readApr 2, 2022

AutoML is coming up fast. As I have never tried any frameworks for autoML, I decided to give it a shot using MLJAR (as recommended on kaggle discussions). After the entire experimentation thing on my end, I just have one thing to say: if you haven’t tried any autoML framework yet:

You are missing out on a big piece !! This is so effortless

It's time to know what MLJAR has to offer following a binary classification problem. Before moving ahead, I have listed down a few key features that one should know about MLJAR beforehand

Works only with Tabular data
Automatic understanding of your data
Has got 4 modes: Explain, Compete, Perform & Optuna depending on user requirements & resources available. We will discuss & experiment with these modes later in the post
Tries multiple models & creates detailed reports around each model tested to choose from.

To know all the cool features MLJAR has to offer, do read the docs here

Sample data

Here, our target variable is income which is a binary class: ≤50k or >50k. Also, the dataset has ~32k entries & is split in an 80:20 fashion for training & validation with 14 features. These features include both Ordinal & Nominal numbers alongside categorical features.

Note: Any algorithm & step chosen by AutoML in any of the modes discussed below are completely based on the dataset considered. These can change when a different dataset comes in.

Let’s import important libraries, load this data & split our dataset for training & testing purposes.

import pandas as pd
import time
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score as acs_score
from supervised.automl import AutoMLdf = pd.read_csv('dummy.csv')df.columns = [x.strip() for x in df.columns]
label = df.pop('income')
x_train,x_test,y_train,y_test = tts(df,label,test_size=0.2)

Moving onto the crux i.e. the 4 different AutoML modes

EXPLAIN

Now, it's time to create an AutoML object with mode=’Explain’. As the name suggests, this mode is used to get:

quick 1st hand experience on the dataset.

automl = AutoML(mode='Explain')

So what it does do first up? the logs did give away some ideas

Decided whether the problem is Regression or Classification
Created a directory AutoML_1 (the suffix _1 depicts autoML objects trained. If we again train autoML to a new dataset, the suffix would be _2 & so on)
Chose logloss as a binary classification metric for evaluation/comparison between different models
Selected models to test out

['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']

What is Baseline?

It's the simplest possible prediction that can be made. This can be done in many ways like
Randomly assigning classes
Assigning majority class to all samples

Next up, it follows a 3 step procedure

simple_algorithms

As the name suggests, it picks up the simplest of the algorithms to have a jibe at the dataset using Baseline & decision tree. The logs suggest only 2 models one of each kind is trained

default_algorithms

Going a bit complex, it picks up other algorithms mentioned above with default parameters. The logs suggest only 3 models one each from XGBoost, Neural Network & RandomForest is trained

ensemble

In this step, all previously trained models are weighed & used to create a single ensemble for the final prediction

AutoML suggested XGBoost model tested in default_algorithms (last line in logs) step is best. On testing, accuracy for test data is 86% !!

This is when we haven’t touched the data for anything & that too in a little over 3 minutes.

Let’s see what is in the directory this training created. It is a big directory with numerous files

I will try summarizing what these files are all about in a one-liner

Train & validation set created internally as NumPy array (.npy files)
A leaderboard CSV for comparing the final results of all models tested.
data_info.json with basic info on data like number of samples, different columns, type of target (numeric or categorical)
progress.json comprising different preprocessing & hyperparameters used for each model’s training
params.json appears to be more a technical report of the entire process done at a coarse level. It comprises info like different models tried; validation tech followed; the final model is chosen; mode is chosen, etc
A very sassy readme summarizing the entire experiment alongside feature_importance & correlation heatmap.
A folder each for every model trained/baseline used including the ensemble model used in step 3. This comprises files around multiple statistics & metrics. This set of metrics/stats can be different for different models. Discussing each one of them is a bit out of scope for now

Compete

If you are on kaggle or are curious about even 0.01% improvements, this can be of great use to you. The ideology is simple:

get the best model

Though, latency is very high !!

This time the logs are a bit longer hence will handle in parts. Also, a few steps remain common i.e. understanding it to be a classification problem and choosing an apt metric (logloss again). Things that get changed are

Models tried are more

['Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']

Different steps followed are

['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']

The validation strategy also becomes a bit complex

Let’s understand the different steps followed this time skipping a few already discussed. As the logs are pretty big, we will discuss them in parts

Adjust validation: Choose an apt validation strategy. Tested over one decision tree for given problem

Simple algorithms (modeling): Same as in ‘Explain’ mode. 3 decision trees tested
Default algorithms (modeling): Same as in ‘Explain’ mode. All ml algorithms mentioned above were tested (one model each) except decision trees.
not_so_random (hyperparameter related): Randomly chosen hyperparameter to be tested. 61 different hyperparameters-model combinations tested in our case
mix_encoding (preprocessing step): This encoding uses label encoding for categorical features with more than 25 categories, and one-hot binary encoding for categoricals with fewer than 25 categories. It is also tested over just one model

golden features (feature engineering): Generating new features by applying basic operations like feature_1+feature_2, feature_1-feature_2,etc. About 10 such features created by MLJar in our case
KMeans features (feature engineering): This is again very interesting as it adds features related to the clustering of the data. This may include
*Distance of a sample from all cluster centroids
*sample’s cluster label
insert_random_feature (feature engineering): This step adds a random feature( a uniform distribution between 0–1) to the dataset.
feature_selection: To drop out any irrelevant feature, any existing feature with feature importance lower than the random feature added in the above step is dropped

Note: Both Kmeans features & insert_random_feature were skipped in our case due to time restrictions (default). As no random feature was added, feature_selection is also skipped

hill-climbing steps (hyperparameter related): It's majorly towards tuning hyperparameters. Tuning is done in multiple steps hence hill_climbing_1, hill_climbing_2. For both the steps, 28 models were tested. For a more detailed discussion on hill climbing: https://www.javatpoint.com/hill-climbing-algorithm-in-ai
boost_on_errors (modeling): Similar to boosting technique where the future model is improved using past errors. This is also skipped due to default time limit

So, we are now left with stack, ensemble & stack ensemble. These are some crucial concepts that require special attention.

Ensemble: ensemble learning revolves around using multiple models rather than a single model to get some prediction. The ensemble can work in multiple ways. The way it is done in MLJAR is shown below

Start with the empty ensemble (no model).
Add to the ensemble the pretrained model from previous steps that maximizes the ensemble’s performance (improves overall prediction metrics).
Repeat the above step for a fixed number of iterations or until all the models have been used.
Return the ensemble from all possible ensembles where performance metric is highest.

What does this mean? Assume you have an Ensemble constituting [A, B] & you are left with C, D for testing where A, B, C, and D are all pretrained models. Now, C will be added to the ensemble if averaged out results of [A, B, C] are better than the existing ensemble [A, B]. Likewise for D

Stacking: It is a type of ensemble modeling that has 2 levels of models

Level-0 (base models) : Models fitted on training data independently
Level-1 (meta model): A model trained to learn how to combine level-0 models (learning weights for each level-0 model) for best predictions. This meta learner takes as input: out-of-fold predictions of base models & learns to get to the target label. Hence: Feature data →Output_base →Output_meta becomes are predicted target label

So, we have models to learn how to combine different models !!

By the way, what is out-of-fold prediction?

Assuming you know k-fold cross-validation, out-of-fold prediction is the prediction done by the model on every k holdout set during training. So, we will have at least one out-of-fold prediction for each sample in the training data(how? in k-fold cross-validation, we divide data into k sets, & in each iteration, we pickup k-1 sets for training & the remaining kth set for testing called the holdout set hence every sample will once come in the holdout set)

Stack ensemble: As conveyed by the author of MLJar to me (I am serious !!), it's an ensemble built from models trained on original data and models on stacked data (original+OOF predictions). I actually had a convo with pplonski (author, MLJAR) over this

difference between stack, ensemble & stack ensemble steps in MLJAR

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

stackoverflow.com

As we are done with understanding Compete, let’s observe its performance

The total time taken is ~1 hour with logloss= 0.276 with accuracy on validation set = 88%. So, an improvement of ~2% over Explain mode !!

The directory created by Compete mode has the same features as we had in Explain mode i.e. a folder for every model tested & similar internal structure as in models trained in Explain mode.

continuing with remaining modes…

Perform

As the documentation boasts, it's best suited when you need an

urgent deployment & decent results (& not the best)

So, it's more like a midway solution between Explain (fast but ordinary results) & Compete (Great results but dead slow). As we have already discussed at length the logs for the previous 2 modes, we shall observe only the changes we might observe in Perform mode

The models tested are

['Random Forest', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network']

The steps followed are

['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']

As we can see, the models tested are definitely more than Explain mode but lesser than Compete mode. Similarly, the steps followed are more than Explain mode but lesser than Compete mode.

Below are the logs for Perform mode (if anyone is interested)

One surprising thing to note from the logs (image 4)

The time taken to train to Perform mode is lumpsum the same as Compete mode (actually a few seconds more !!)
The final metric (logloss) degrades by a mere ~0.002 (0.278) than compete mode with accuracy = 87.7%

Hence, at least for this dataset, both Perform & Compete modes perform similarly without any significant difference. Though it might be the case as the dataset becomes more complex, we can see significant differences.

Optuna

Available as an independent library as well, this optimization framework has become more powerful than ever with integration with autoML

Algorithms tested

['Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network']

Steps involved

['simple_algorithms', 'default_algorithms', 'ensemble', 'stack', 'ensemble_stacked']

Now, the slight change that occurs in Optuna mode is every model that is trained at any step, its hyperparameters are tuned depending upon time_budget passed as hyperparameter

automl4 = AutoML(mode="Optuna", optuna_time_budget=120)

Here, optuna_time_budget is time in seconds each model at every step will be optimized. Also, depending on the optuna_time_budget, the results will improve.

Note: optuna_time_budget=120 is just to complete the experiment. When actually training, do use a value higher than 120 for better results.

As the logs include both autoML & optuna, will be sharing excerpts from them for an understanding:

A few points to note:

Optuna mode took ~1.5 hrs for optuna_time_budget=120. This will increase if the optuna_time_budget hyperparameter is increased
logloss=0.275, the lowest amongst all the modes but accuracy goes around 86.6% (~1.5% lower than compete)

Comparing modes

Though the logloss & accuracy are approximately the same for all the modes, it's the time taken that makes the Explain mode look too good compared to others. Though, there are cases where we may fight for even 0.01% of metrics improvements where modes like Compete or Perform would be a necessity. Also, we mustn’t forget given some more time, Optuna can do wonders & we may see a shoot up in metrics.

Final words

In all, MLJar is super easy & a powerful framework for autoML. The performance speaks for itself given the fact we didn’t analyze a single thing in the dataset, yet achieved ~86% accuracy almost in every mode. If a few steps appear unnecessary to you in a given mode, these can be customized easily making it flexible according to user needs. One must definitely give it a shot.