PyCaret: Comprehensive Guide and Insights

7 min readSep 22, 2023

In today’s fast-paced world, automation is key. PyCaret offers a streamlined approach to simplify the end-to-end machine learning workflow, making your life easier.

You might wonder, what’s left for you to do? Well, you have the choice to either write everything from scratch or let PyCaret handle the heavy lifting.

Author’s Note:
It’s worth emphasizing that PyCaret is NOT driven by AI but operates based on your directives or default settings.

Let’s dive in and explore its capabilities!

Intended Audience

This blog offers a comprehensive view of every aspect of the PyCaret module. If you are seeking specific functionality within this module, feel free to navigate directly to that section.

What is it?

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.

— official documentation.

What are its functions?

It encompasses tasks such as data preprocessing, handling missing values, detecting outliers, feature selection, training models, choosing the best models, fine-tuning hyperparameters, employing ensemble modeling techniques, and more.

Installation

# installation (might need to install dependencies manually as per requirement)
pip install pycaret

# Full version (PREFERRED)
pip install pycaret[full]

# for jupyter notebook or google colab
# (The exclamation mark (!) in notebooks is used to run shell commands.)
!pip install pycaret[full]

Modules Supported by PyCaret:

Supervised ML:

# pycaret classification
from pycaret.classification import *

# pycaret regression
from pycaret.regression import *

2. Unsupervised ML:

# pycaret clustering
from pycaret.clustering import *

# pycaret anomaly detection
from pycaret.anomaly import *

3. Time Series:

# pycaret time series
from pycaret.time_series import *

Version Checking

# check installed version
import pycaret
pycaret.__version

#Latest as of Sept. 20, 2023 is '3.1.0'

Accessing the Datasets provided by PyCaret:

from pycaret.datasets import get_data

# Displays the list of available datasets
all_datasets = get_data('index')

# Access the required dataset (return type: pandas df)
df = get_data('iris')

Author’s Note:
pre-curated datasets are valuable for learning and experimentation but may not fully represent the complexity and challenges of working with real-world data. Real-world applications often require a more diverse set of skills in data collection, cleaning, and feature engineering to derive meaningful insights.

The image was captured within the Google-Colab environment, utilizing an interactive table widget, and was pre-filtered to display only regression datasets.

Setup() Function

Author’s Note:
This tutorial is centered around Regression Machine Learning task utilizing the ‘insurance’ dataset from pycaret.datasets. For varying Machine Learning tasks (Unsupervised or Time-series), the fundamental structure of the .setup() method will remain consistent; however, specific attributes may be included or excluded to align with the particular use case.

Import the necessary modules

# import the regression module
from pycaret.regression import *

# using pre-curated dataset for this tutorial
from pycaret.datasets import get_data
df = get_data('insurance')

2. Check the Data-types

# check the data-types of each features, correct - if required.
# Current dataset contains the correct types for each features.
df.dtypes

3. Obtain the supported ML models.

Retrieve the list of supported machine learning models. Determine which specific models are needed for your application. Record the IDs of the selected models either in a separate list or directly within the compare_models() function in upcoming steps.

Author’s Note:
You have the option to skip this step if you aim for comprehensive coverage by applying all the defined models in your application, with the intention of gauging their relative performance. However, it’s worth noting that this approach may NOT be advisable because different models have distinct preprocessing requirements, and such indiscriminate application could potentially disrupt the learning process.

# to execute models() statement, we first create a dummy setup.
# without setup(), one cant execute models() statement !
setup(data = df, verbose=False) 

# Get the list of supported models
models()

let's apply the following models:
1. Linear Regression (lr)
2. Ridge Regression (ridge)
3. Lasso Regression (lasso)

4. Identify the preprocessing requirements for the selected model(s).

CHECK-LIST
1. Scaling......................required
2. Handling Categorical Data....required
3. Handling Missing Data........required
4. Outlier Detection/Removal....required
5. Transformation...............let's go with not-required !

5. Let’s setup()…

# Uncomment to view the structure/parameters of the setup function
# help(setup)

reg = setup(
    
    # session_id is equivalent to 'random_state' in scikit-learn
    session_id = 6842,

    # Dataset to work with
    data = df,

    # Dependent feature (index or feature name)
    target = -1,

    # training and validation proportion (default 0.7)
    train_size = 0.8,

    # Numerical and Categorical features
    # - (specify only if the inffered types are incorrect)
    # - Do NOT include target feature
    # - example include,
    # numeric_features = ['age', 'bmi', 'children', 'smoker'],
    # categorical_features = ['sex',  'region'],

    # missing value Impute (default 'simple', other: 'iterative')
    imputation_type = 'simple',

    # numerical imputation type
    # - (Ignored when ``imputation_type= iterative``)
    # - available options: "drop", "mean" (default), "median", "mode", "knn", int or float
    numeric_imputation = 'knn',

    # categorical imputation type
    # - (Ignored when ``imputation_type= iterative``)
    # - available options: "drop", "mode" (default), str
    categorical_imputation = 'mode',

    # Ordinal features lowest(0) to highest(max)
    # - no ordinal features are present in dataset, but example include,
    # ordinal_features = {
    #             'Outlet_Size' : ['Small', 'Medium', 'High'],
    #             'Outlet_Location_Type' : ['Tier 3', 'Tier 2', 'Tier 1']
    # },

    # StandardScaler will be applied by default
    # - available options: 'zscore' (default), 'minmax', 'maxabs', 'robust'
    normalize = True,
    normalize_method = 'zscore',


    # Low Variance Filter (default: None)
    # - threshold = 0 ensures exclusion of columns with single values.
    # - one can set a custom threshold if required.
    low_variance_threshold = 0,

    # outliers from the training data are removed using an Isolation Forest
    remove_outliers = True,

    # Transformation applied using default 'yeo-johnson'
    # - available options: 'yeo-johnson' (default), 'quantile'
    transformation = False,
    transformation_method = 'yeo-johnson',

    # Feature Selection portion is excluded, but one can refer: help(setup)

)

# One can access/check the all available configurations of setup() method.
get_config() # returns a list of available config.

# possible use-case of get_config():
get_config('X_train_transformed')['age'].hist()

# use set_config(<name>, <new-value>) to manipulate the configurations.
# Possible writeable variables are: 
# ['idx', 'n_jobs_param', 'log_plots_param', 'seed', 'pipeline', 
#  'logging_param', 'gpu_param', 'exp_name_log', 'fold_groups_param', 
#  'html_param', 'USI', 'target_param', 'memory', 'fold_shuffle_param', 
#  'transform_target_param', 'exp_id', 'fold_generator', 'data']

6. Create/Compare models

Author’s note:
The `create_model()` method is excluded from this review for a specific reason: in real-life scenarios, it’s uncommon to be absolutely certain about the best model without comparing the performance of different models. If you do happen to know the best model in advance, please refer to this for assistance with syntax and implementation.

best_R2_models_top2 = compare_models(
    # write the selected models
    # ignore, if you wish to apply every available models
    include = ['lr', 'ridge', 'lasso'],

    # if you want to exclude any from the entire set of available models
    # every models except 'dt' will applied
    # ignore, if you wish to apply every available models
    # exclude = ['dt'],

    # performance will be sorted based on
    # - MAE, MSE, RMSE, R2 (default), RMSLE, and MAPE
    sort = 'R2',

    # number of top n models as in return
    n_select = 2
)

references of top2 models got stored in the variable as a list

7. Hyperparameter Tuning

# tune each model in the list
# alternatively use: 'tune_model(model)' keeping n_select = 1 (default)
for model in range(len(best_R2_models_top2)):
  best_R2_models_top2[model] = tune_model(best_R2_models_top2[model])

8. Ensemble learning (stacking & blending)

Resource(s) for detailed information: blog & github

# blend top 2 models
# use this to get help: help(blend_models)
blended = blend_models(best_R2_models_top2)

# stack models
# use this to get help: help(stack_models)
stacked = stack_models(best_R2_models_top2)

9. Visualization

dashboard(blended, display_format ='inline')

# or use to get more details
# evaluate_model(blended)

evaluate_model(blended) interactive widget

10. Finalize the model

# finalize a model
finalised = finalize_model(blended)
finalised

11. Save/load the pipeline

# save pipeline
save_model(finalised_best, 'finalised_best_pipeline')

# load pipeline
load_model('finalised_best_pipeline')

Conclusion

In conclusion, PyCaret is a versatile and powerful open-source machine learning library in Python that significantly simplifies the end-to-end machine learning workflow. It automates various tasks, from data preprocessing to model selection, hyperparameter tuning, and ensemble modeling. PyCaret offers flexibility and efficiency, making it a valuable tool for both beginners and experienced data scientists. Keep in mind that PyCaret operates based on your directives or default settings rather than AI-driven decision-making. Whether you’re exploring machine learning or looking to streamline your ML projects, PyCaret is a tool worth exploring to enhance your productivity and efficiency.

Expression of Appreciation

Dear Valued Readers,

I sincerely appreciate your time and commitment to this blog. Your pursuit of knowledge is commendable, and I’m honored to share insights with you. Your time is cherished, and I’m dedicated to delivering enriching content. Thank you for joining this journey.

Best regards,

Akshat A.