Understanding the MLJAR AutoML framework
experimenting with MLJAR on a dummy dataset
AutoML is coming up fast. As I have never tried any frameworks for autoML, I decided to give it a shot using MLJAR (as recommended on kaggle discussions). After the entire experimentation thing on my end, I just have one thing to say: if you haven’t tried any autoML framework yet:
You are missing out on a big piece !! This is so effortless
It's time to know what MLJAR has to offer following a binary classification problem. Before moving ahead, I have listed down a few key features that one should know about MLJAR beforehand
Works only with Tabular data
Automatic understanding of your data
Has got 4 modes: Explain, Compete, Perform & Optuna depending on user requirements & resources available. We will discuss & experiment with these modes later in the post
Tries multiple models & creates detailed reports around each model tested to choose from.
To know all the cool features MLJAR has to offer, do read the docs here
Sample data
Here, our target variable is income which is a binary class: ≤50k or >50k. Also, the dataset has ~32k entries & is split in an 80:20 fashion for training & validation with 14 features. These features include both Ordinal & Nominal numbers alongside categorical features.
Note: Any algorithm & step chosen by AutoML in any of the modes discussed below are completely based on the dataset considered. These can change when a different dataset comes in.
Let’s import important libraries, load this data & split our dataset for training & testing purposes.
import pandas as pd
import time
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score as acs_score
from supervised.automl import AutoMLdf = pd.read_csv('dummy.csv')df.columns = [x.strip() for x in df.columns]
label = df.pop('income')
x_train,x_test,y_train,y_test = tts(df,label,test_size=0.2)
Moving onto the crux i.e. the 4 different AutoML modes
EXPLAIN
Now, it's time to create an AutoML object with mode=’Explain’. As the name suggests, this mode is used to get:
quick 1st hand experience on the dataset.
automl = AutoML(mode='Explain')
So what it does do first up? the logs did give away some ideas
- Decided whether the problem is Regression or Classification
- Created a directory AutoML_1 (the suffix _1 depicts autoML objects trained. If we again train autoML to a new dataset, the suffix would be _2 & so on)
- Chose logloss as a binary classification metric for evaluation/comparison between different models
- Selected models to test out
['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
What is Baseline?
It's the simplest possible prediction that can be made. This can be done in many ways like
Randomly assigning classes
Assigning majority class to all samples
Next up, it follows a 3 step procedure
simple_algorithms
As the name suggests, it picks up the simplest of the algorithms to have a jibe at the dataset using Baseline & decision tree. The logs suggest only 2 models one of each kind is trained
default_algorithms
Going a bit complex, it picks up other algorithms mentioned above with default parameters. The logs suggest only 3 models one each from XGBoost, Neural Network & RandomForest is trained
ensemble
In this step, all previously trained models are weighed & used to create a single ensemble for the final prediction
AutoML suggested XGBoost model tested in default_algorithms (last line in logs) step is best. On testing, accuracy for test data is 86% !!
This is when we haven’t touched the data for anything & that too in a little over 3 minutes.
Let’s see what is in the directory this training created. It is a big directory with numerous files
I will try summarizing what these files are all about in a one-liner
- Train & validation set created internally as NumPy array (.npy files)
- A leaderboard CSV for comparing the final results of all models tested.
- data_info.json with basic info on data like number of samples, different columns, type of target (numeric or categorical)
- progress.json comprising different preprocessing & hyperparameters used for each model’s training
- params.json appears to be more a technical report of the entire process done at a coarse level. It comprises info like different models tried; validation tech followed; the final model is chosen; mode is chosen, etc
- A very sassy readme summarizing the entire experiment alongside feature_importance & correlation heatmap.
- A folder each for every model trained/baseline used including the ensemble model used in step 3. This comprises files around multiple statistics & metrics. This set of metrics/stats can be different for different models. Discussing each one of them is a bit out of scope for now
Compete
If you are on kaggle or are curious about even 0.01% improvements, this can be of great use to you. The ideology is simple:
get the best model
Though, latency is very high !!
This time the logs are a bit longer hence will handle in parts. Also, a few steps remain common i.e. understanding it to be a classification problem and choosing an apt metric (logloss again). Things that get changed are
- Models tried are more
['Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
- Different steps followed are
['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
- The validation strategy also becomes a bit complex
Let’s understand the different steps followed this time skipping a few already discussed. As the logs are pretty big, we will discuss them in parts
Adjust validation: Choose an apt validation strategy. Tested over one decision tree for given problem
Simple algorithms (modeling): Same as in ‘Explain’ mode. 3 decision trees tested
Default algorithms (modeling): Same as in ‘Explain’ mode. All ml algorithms mentioned above were tested (one model each) except decision trees.
not_so_random (hyperparameter related): Randomly chosen hyperparameter to be tested. 61 different hyperparameters-model combinations tested in our case
mix_encoding (preprocessing step): This encoding uses label encoding for categorical features with more than 25 categories, and one-hot binary encoding for categoricals with fewer than 25 categories. It is also tested over just one model
golden features (feature engineering): Generating new features by applying basic operations like feature_1+feature_2, feature_1-feature_2,etc. About 10 such features created by MLJar in our case
KMeans features (feature engineering): This is again very interesting as it adds features related to the clustering of the data. This may include
*Distance of a sample from all cluster centroids
*sample’s cluster label
insert_random_feature (feature engineering): This step adds a random feature( a uniform distribution between 0–1) to the dataset.
feature_selection: To drop out any irrelevant feature, any existing feature with feature importance lower than the random feature added in the above step is dropped
Note: Both Kmeans features & insert_random_feature were skipped in our case due to time restrictions (default). As no random feature was added, feature_selection is also skipped
hill-climbing steps (hyperparameter related): It's majorly towards tuning hyperparameters. Tuning is done in multiple steps hence hill_climbing_1, hill_climbing_2. For both the steps, 28 models were tested. For a more detailed discussion on hill climbing: https://www.javatpoint.com/hill-climbing-algorithm-in-ai
boost_on_errors (modeling): Similar to boosting technique where the future model is improved using past errors. This is also skipped due to default time limit
So, we are now left with stack, ensemble & stack ensemble. These are some crucial concepts that require special attention.
Ensemble: ensemble learning revolves around using multiple models rather than a single model to get some prediction. The ensemble can work in multiple ways. The way it is done in MLJAR is shown below
Start with the empty ensemble (no model).
Add to the ensemble the pretrained model from previous steps that maximizes the ensemble’s performance (improves overall prediction metrics).
Repeat the above step for a fixed number of iterations or until all the models have been used.
Return the ensemble from all possible ensembles where performance metric is highest.
What does this mean? Assume you have an Ensemble constituting [A, B] & you are left with C, D for testing where A, B, C, and D are all pretrained models. Now, C will be added to the ensemble if averaged out results of [A, B, C] are better than the existing ensemble [A, B]. Likewise for D
Stacking: It is a type of ensemble modeling that has 2 levels of models
Level-0 (base models) : Models fitted on training data independently
Level-1 (meta model): A model trained to learn how to combine level-0 models (learning weights for each level-0 model) for best predictions. This meta learner takes as input: out-of-fold predictions of base models & learns to get to the target label. Hence: Feature data →Output_base →Output_meta becomes are predicted target label
So, we have models to learn how to combine different models !!
By the way, what is out-of-fold prediction?
Assuming you know k-fold cross-validation, out-of-fold prediction is the prediction done by the model on every k holdout set during training. So, we will have at least one out-of-fold prediction for each sample in the training data(how? in k-fold cross-validation, we divide data into k sets, & in each iteration, we pickup k-1 sets for training & the remaining kth set for testing called the holdout set hence every sample will once come in the holdout set)
Stack ensemble: As conveyed by the author of MLJar to me (I am serious !!), it's an ensemble built from models trained on original data and models on stacked data (original+OOF predictions). I actually had a convo with pplonski (author, MLJAR) over this
As we are done with understanding Compete, let’s observe its performance
The total time taken is ~1 hour with logloss= 0.276 with accuracy on validation set = 88%. So, an improvement of ~2% over Explain mode !!
The directory created by Compete mode has the same features as we had in Explain mode i.e. a folder for every model tested & similar internal structure as in models trained in Explain mode.
continuing with remaining modes…
Perform
As the documentation boasts, it's best suited when you need an
urgent deployment & decent results (& not the best)
So, it's more like a midway solution between Explain (fast but ordinary results) & Compete (Great results but dead slow). As we have already discussed at length the logs for the previous 2 modes, we shall observe only the changes we might observe in Perform mode
- The models tested are
['Random Forest', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network']
- The steps followed are
['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
As we can see, the models tested are definitely more than Explain mode but lesser than Compete mode. Similarly, the steps followed are more than Explain mode but lesser than Compete mode.
Below are the logs for Perform mode (if anyone is interested)
One surprising thing to note from the logs (image 4)
The time taken to train to Perform mode is lumpsum the same as Compete mode (actually a few seconds more !!)
The final metric (logloss) degrades by a mere ~0.002 (0.278) than compete mode with accuracy = 87.7%
Hence, at least for this dataset, both Perform & Compete modes perform similarly without any significant difference. Though it might be the case as the dataset becomes more complex, we can see significant differences.
Optuna
Available as an independent library as well, this optimization framework has become more powerful than ever with integration with autoML
- Algorithms tested
['Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network']
- Steps involved
['simple_algorithms', 'default_algorithms', 'ensemble', 'stack', 'ensemble_stacked']
Now, the slight change that occurs in Optuna mode is every model that is trained at any step, its hyperparameters are tuned depending upon time_budget passed as hyperparameter
automl4 = AutoML(mode="Optuna", optuna_time_budget=120)
Here, optuna_time_budget is time in seconds each model at every step will be optimized. Also, depending on the optuna_time_budget, the results will improve.
Note: optuna_time_budget=120 is just to complete the experiment. When actually training, do use a value higher than 120 for better results.
As the logs include both autoML & optuna, will be sharing excerpts from them for an understanding:
A few points to note:
- Optuna mode took ~1.5 hrs for optuna_time_budget=120. This will increase if the optuna_time_budget hyperparameter is increased
- logloss=0.275, the lowest amongst all the modes but accuracy goes around 86.6% (~1.5% lower than compete)
Comparing modes
Though the logloss & accuracy are approximately the same for all the modes, it's the time taken that makes the Explain mode look too good compared to others. Though, there are cases where we may fight for even 0.01% of metrics improvements where modes like Compete or Perform would be a necessity. Also, we mustn’t forget given some more time, Optuna can do wonders & we may see a shoot up in metrics.
Final words
In all, MLJar is super easy & a powerful framework for autoML. The performance speaks for itself given the fact we didn’t analyze a single thing in the dataset, yet achieved ~86% accuracy almost in every mode. If a few steps appear unnecessary to you in a given mode, these can be customized easily making it flexible according to user needs. One must definitely give it a shot.