Lazy Predict All Models in One Go

Sina Nazeri
The Power of AI
Published in
7 min readMar 22, 2023

Learn to use LazyPredict, a semi-automated ML library for machine-learning tasks.

As machine learning becomes more prominent in various fields — ranging from healthcare to retail — being able to utilize it is a valuable asset. Conveniently, a new subfield of machine learning has emerged: automated machine learning. Designed for non-experts, it automatically runs a variety of models under a certain category and compares their performance.

➜ Pro tip: If you like to learn faster and run (or download) this project as a Jupyter Notebook for free, visit CognitiveClass.ai.

In this guided project, you will learn to use LazyPredict, a semi-automated ML library for machine-learning tasks. Specifically, you will use it to predict two real-life scenarios: 1) whether a flight will be delayed and 2) the chances of admission to a university.

Objectives

After completing this lab you will be able to:

  • Understand what AutoML is
  • Apply LazyPredict to classification and regression problems
  • Evaluate models’ performance from LazyPredict

Installing and Importing Required Libraries

!pip install lazypredict
!pip install scikit-learn==0.23.1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import lazypredict
from lazypredict.Supervised import LazyClassifier
from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split

Introducing AutoML

What is AutoML?

Automated Machine Learning (AutoML) is a series of tools that assist non-ML experts in running a variety of models for a certain task. This means that now, one doesn’t need to go through the cumbersome process of learning the syntax for scikit-learn and PyTorch models! How awesome is that?

Some examples of AutoML include:

  • Auto-sklearn: based on scikit-learn, compares different machine learning models available in the library.
  • Auto-PyTorch: based on PyTorch, compares different neural architectures and tunes hyperparameters as well.
  • AutoWEKA: selects the best machine learning model taking into account hyperparameter tuning results.

You can read more about them here.

Shortcomings of AutoML

Despite being able to automatically train multiple models for a task, AutoML does come with a couple of drawbacks:

  • Significantly longer training time
  • Only selects models that do well in the validation set. It doesn’t consider other metrics that are also important to evaluating model performance, such as the amount of time needed for prediction

LazyPredict is an AutoML tool similar to "auto-sklearn". Depending on whether the task is classification or regression, you can define a LazyClassifier() or LazyRegressor() object, which will fit your training data on ~20-30 available models. This simplifies the usual process of model comparison, where you'd need to define all of those objects and train them separately. Let's jump right into it!

Lazy Predict: Flight Delay

To illustrate how LazyPredict works, let's try it on a simple binary classification task - determine if the flight would be delayed given information about the flight.

Read the data

As usual, use pandas to read the data:

flight_delay = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IND-GPXX0JD1EN/Airlines.csv")
flight_delay.head()

Check the variable type for each column:

flight_delay.info()

Notice that the string columns all have the type “object”. To make it better suited for classification models, let’s convert them to type “category”:

for col in ['Airline', 'AirportFrom', 'AirportTo']:
flight_delay[col] = flight_delay.loc[:, col].astype('category')

Lastly, check if there are any missing values to take care of:

flight_delay.isna().sum()
Lucky for us, this dataframe doesn’t have any NAs!

Train-test split

For any machine learning model, the data should be split into training and test sets to not overfit the data as well as test the model’s performance on unseen data.

print("Number of observations: ", flight_delay.shape[0])
print("Number of features: ", flight_delay.shape[1])

First, let’s separate out the features (X) from the target variable (y), which is a binary variable indicating if the flight was delayed:

X = flight_delay[:8000].drop(['id', 'Delay'], axis=1)
y = np.array(flight_delay['Delay'][:8000])

Note: Due to memory limitations in SN labs, you won’t be able to finish running lazy prediction on the original dataset. For the purpose of this project, I have taken a subset (first 8000 observations). However, if you’re running the notebook on your local machine with more memory available, feel free to use the entire data!

Next, split both feature and target variables into 75% training and 25% test. The split is stratified to ensure that there is an equal distribution of different target class labels in both sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=0)
print("Train set size: ", X_train.shape)
print("Test set size: ", X_test.shape)

Fit the model

Now, time for some magic. Since our task is classification, we would use LazyClassifier to experiment with all the classification models. Similar to how you'd define and fit a model in scikit-learn, do the same for LazyClassifier.

Note: This could take a while to run (10+ minutes), which makes sense given we’re fitting so many models!

# Specify predictions=True returns test set predictions as a dataframe
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None, predictions=True)
models, preds = clf.fit(X_train, X_test, y_train, y_test)

You can access test set predictions at preds. Print models, which outputs a table outlining each model's name and performance in common classification metrics using the test set: accuracy, balanced accuracy, ROC-AUC, and F1 score.

print(models)

Which model do you think is the best? This depends on which metric is important for your task. If accuracy is crucial, then AdaBoost and LabelSpreading are the best, whereas in terms of F1 score, Random Forest performs slightly better. Whichever you choose to base your decisions on, just make sure that you have a valid reason for it.

So…what now?

Now with some candidates for the best-performing model, you can use hyperparameter tuning to further improve results and/or predict new observations with them. Since LazyPredict didn't incorporate predict() as a separate function, you will have to obtain its values the sklearn way: define a new model object (i.e. RandomForestClassifier), fit it on your train set and call predict().

Note: auto-sklearn separated out this function; try playing around with that and see which you prefer

Exercises: Regression

Now, let’s try LazyRegressor for a regression task - predicting chances of university admissions. This dataset contains information about each student's academic performance - including test scores, GPA, research experience - as well as the strength of their application material and the ranking of the university they're applying to. We will try to predict the continuous variable Chance of Admit.

More information about the dataset and variables can be found here.

adm = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IND-GPXX0JD1EN/adm_data.csv")
# Check sample values
adm.head()

➜ If you like to exercise and try the regression model on a university addimision prediction, visit the CognitiveClass.ai.

Concluding Note

Now you understand why it’s called “Lazy Predict”! You can quickly determine the best-performing model for a specific use case without a chunk of code. Before you go into a model-fitting frenzy and report the best models for every task you’re working on, do note that this library is, ultimately, still a tool. You should always stay mindful of the assumptions and structure behind the model you choose, and what that’d imply about the overall appropriateness of its application to your specific dataset and task.

Current Limitations

Some cases that AutoML cannot handle (yet) include:

  1. Unsupervised machine learning: these processes are often exploratory and don’t have a universal metric comparable among all the techniques.
  2. Feature engineering: many models benefit from new features created with domain knowledge; AutoML cannot replicate this creative process.
  3. Complex data type: images, text, etc.

There’s still a lot of room for progress in this field, let’s see where it takes us :)

You can follow me on Medium or LinkedIn and stay tuned for more articles on Data Science, Machine Learning, and AI.

If you are interested in my project, here is my IBM skills network profile:

--

--

Sina Nazeri
The Power of AI

Data Scientist at IBM with broad ML skills: Classification, Clustering, CV, NLP, Generative AI. Strong academic background & research/work experience.