PyCaret —Prepare your Machine Learning model in minutes

Ankur Salunke
Analytics Vidhya
Published in
7 min readSep 15, 2020
PyCaret — Covering all the basics for rapid prototyping

While working on a machine learning problem wouldn’t it be better if we can make a quick comparison of a few models which would in turn help us in deciding to which model do we dedicate our time and resources. PyCaret is a library which helps in doing precisely that. The same can be done by even scikit-learn but what sets PyCaret apart is that it gives us a low code version to work on which speeds up the process. But it is not limited to prototyping, but also can be used to develop and deploy a full scale machine learning model.

As we see in figure above, PyCaret provides us the entire gamut of operations for machine learning. We would run through all these steps in PyCaret and would be using the Titanic Dataset for a classification problem. PyCaret can be used to model both supervised and unsupervised problems. Here we would cover supervised — classification to get a gist of the PyCaret library.

1. Data Preprocessing

Let us import the necessary libraries first. Next we would import the dataset using pandas and check a sample of the records.

!pip install pycaret
import pandas as pd
import numpy as np
from pycaret.classification import *
train = pd.read_csv("../titanic/train.csv")
train.head()

Output:-

In PyCaret there is a setup function which sets up the data file and the dependant variable. The preprocessing is also part of the setup function. We will keep adding preprocessing parameters to the setup function.

Let’s start off with Missing values. Survived is our target variable.

clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant" )

Here we are imputing the numeric variables with their mean and the categorical variables with “not available”. These are the default imputation parameters, these imputations take place even if we done explicitly pass these parameters in the setup function. The other options available for imputing are “median” for numeric and “mode” for categorical.

One Hot encoding is the next step as we would need to transform the categorical variables into numeric type to input it in a machine learning model. We save time here as PyCaret automatically one hot encodes all the categorical variables when we setup the data (using the setup function).

There might be ordinality to your categorical variables which would cause information loss if one hot encoded. For that there is “ordinal_features” parameter. The sequence must be from lowest to highest while defining the parameters. In our dataset we can use it for Passenger class variable.

clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']} )

PyCaret identifies Pclass as a categorical variable even though it is of numeric type due to the no of unique values in this field.

Sometimes we may come across categories in real word/test data that was not present while training our models. PyCaret gives us a way to handle unknown categories. There are the “handle_unknown_categorical” and “unknown_categorical_method” parameters which enable us to handle these unknown categories.

clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent" )

The default values for these parameters are True and “least_frequent”. We can turn this off by setting the flag as false for handle_unknown_categorical. We can even change the method to “most_frequent”. The unknown category gets replaced by the most_frequent or least_frequent category as per our definition.

Normalisation/Scaling is a critical component of any preprocessing process. We can’t expect to get correct output without scaling our data in all models which use euclidean distance.

PyCaret provides us the paramters normalize and normalize_method for scaling. The methods available are “z-score”, “minimax”, “maxabs”, “robust”. We would find the scikit-learn equivalents for all these methods. “z-score” uses standard scaling and is the default method.

clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent",normalize = True, normalize_method= "z-score")

Binning of a continuous features is also an important part of feature engineering and can be used to refine the input model in some cases. PyCaret provide the bin_numeric_features parameter for this transformation. The “sturges” rule is used to determine the number of bins and K-means is used for conversion to bins.

clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent",normalize = True, normalize_method= "z-score",bin_numeric_features=["Age"])
Data Setup Snapshot

What we see on the above is the output that we get on running the setup.

2. Model Comparison

Normally we compare models post modelling. But PyCaret gives us an option to evaluate multiple algorithms. The scoring is done on a holdout/test set.

compare = compare_models()

Output

Model Comparison Output

Just this one line of code has given us a comparison of 15 algorithms. They are scored basis Accuracy, AUC, Recall, Precision, F1 score, Kappa and MCC. By default the list is sorted by the best accuracy score.

There are some parameters which can be used along with the compare_models function as show below.

top3 = compare_models(n_select = 5) # Top 5 Models
best = compare_models(sort = 'AUC') # Sorted by AUC
best_specific = compare_models(whitelist = ['lr','knn','dt']) # only these three models compared
best_specific = compare_models(blacklist = ['catboost', 'xgboost']) # compare all models except for categorical boost and XGBoost.

3. Model Creation

From the comparison we did earlier we can narrow down our list of choices for modelling. Since Categorical Boost gave us the best accuracy score. Let us use that for creating our model.

cat_boost = create_model('catboost')
Create model output

We have a 10 fold result as the output of our model creation. We can specify if we want a KFold solution and the number if folds using the cross_validation and fold parameters. The accuracy averages out to 82.54% which is what we got during the comparison. Let us see if we can improve on this.

4. Hyperparameter Tuning

Hyperparameter Tuning in PyCaret can also be done in a single line of code. This is done through random grid search through predefined grids which can be customized.

tuned_catboost = tune_model(cat_boost)
Hyperparameter tuning output

Since the results has deteriorated we will continue without tuning. If we want to continue with the tuning manually we can create a customized grid for the hyperparameters to be tried out.

params = {catboost hyperparamters dictionary grid}tuned_catboost = tune_model(cat boost, custom_grid = params)

5. Ensembling

PyCaret gives us the ability to ensemble models also. We have the entire spectrum of ensembling at our disposal — bagging, boosting, blending and stacking.

We would apply boosting for our dataset. We have applied Adaboost with 100 estimators. We can modify the method to Bagging.

boosted_dt = ensemble_model(cat_boost, method = 'Boosting', n_estimators = 100)
Ensemble Output

There is no improvement here. We can use stacking and blending to check for better results.

6. Deployment

We would split the deployment in two parts — 1. finalizing the model to make predictions , 2. Deploying the model in aws

We will be covering only the first part here. The deployment to aws can be checked out in the PyCaret documentation.

First let’s finalize the data and make predictions on a test set. When we finalize the data the model is trained on the whole training set i.e. including the test/holdout set.

catboost_final = finalize_model(cat_boost)
test_predictions = predict_model(catboost_final, data=test)
test_predictions.head()
Predictions on the test set

The label columns has all the predictions for the test set.

I hope you enjoyed reading through this guide which can help you in rapid prototyping when you are starting on a project. It can help you in deciding which direction to take and also in case of paucity of time this library helps in getting very good results quickly.

--

--