Pycaret: A library made ML easy

Divyasribhargavi
School of ML
Published in
7 min readAug 30, 2020

“We yearn for the beautiful, the unknown, and the mysterious.”
– Issey Miyake

source: lograstudio.com

Right from I started learning machine learning, I only found one library for data modeling i.e Scikit-Learn. I agree that the library is great but something was missing in that library, writing all those lengthy syntaxes sometimes used to overwhelm me. But then I thought

what if there was a library that could make my work much simpler and execute the same task with fewer lines of code

That’s how I was introduced to Pycaret

source: internet

Okay! jokes apart. But I found this library to be quite useful because this helps me to reduce say 20 lines of code to just lines of code

What is pycaret?

Pycaret is an open-source low- code machine learning library which reduces life cycle time from hypothesis to insights in ML experiment.Pycaret is easy, simple to use and deployment-ready. It automatically creates a pipeline that can be a binary file and can be used across different environments. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy and many more.

Pycaret has the modules for the following techniques:

Regression

Classification

Clustering

Anomaly detection

Natural language processing

Association rule mining

Click on the links for more information

To know how exactly pycaret simplifies data science process, let’s understand through an example. Here, in this case, we will use anomaly detection module to understand the simplicity of code in pycaret.

But before we start, let’s see the installation in different environments

Installing on anaconda environment

#create a conda environment
conda create --name yourenvname python=3.6
#activate environment
conda activate yourenvname
#install pycaret
pip install pycaret==2.1
#create notebook kernel connected with the conda environment
python -m ipykernel install --user --name yourenvname --display-name "display-name"

Installing on Google Colab

for enabling visualisations

#For Google Colab only
from pycaret.utils import enable_colab
enable_colab()

for installation

! pip install pycaret==2.1

Installing on Mac OS

MAC users will have to install LightGBM separately using Homebrew, or can be built using CMake and Apple Clang or gcc. See the instructions below:

  1. Install CMake (3.16 or higher)
    >> brew install cmake
  2. Install OpenMP
  3. >> brew install libomp
  4. Run the following command in terminal:
git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM
mkdir build ; cd build
cmake ..
make -j4

for more details on how to install visit here

Let’s begin with our example and compare it with Scikit-Learn

Anomaly Detection

Anamoly detection is nothing but the detection of the outliers in the datasets.

These outliers can be indicators for the suspicious activity in real-life situations

Some real-life examples where Anamoly detection is used are in sectors of banking, finance, manufacturing and healthcare.

Photo by Randy Fath on Unsplash

Step 1: Reading from the dataset

The dataset could be read In two ways

i) Reading the dataset from Pandas

ii) loading the data from Pycaret’s repository

Reading the data from pandas

# Importing data using pandas
import pandas as pd
data = pd.read_csv('c:/path_to_data/file.csv')

Reading the data from Pycaret’s repository

# Loading data from pycaret
from pycaret.datasets import get_data
data = get_data('anamoly')

Here is my case, I used the dataset from Pycaret Repo

Step 2: Setting up the environment

Before we start modeling, we need to first setup the environment

In setup, there are two steps

Importing the module

Setup environment

Importing the module

First, we need to import the module before we start modeling. We can import the modules in the following way

importing modules

Syntax:

from pycaret.xx import *

Setting up the environment

Setup initialises the environment in pycaret. Setup function comes with a lot of attributes. When you setup, it will automatically impute categorical and numerical features. By default, the numerical features are imputed by mean and categorical features with constant.

Syntax:

syntax for setup

output:

Some attributes of setup

Data : Dataframe(array or sparse-matrix)

Original data: Shape of the array

Numeric features: Number of features present in data. Here in my case, the dataset has 10 numerical features. If any numeric feature is inferred as categorical then we can change the data type like this

numeric_features = [‘column1’].

Categorical features: Number of Categorical features present in the data. If any categorical feature is inferred as numeric, we can change it to categorical like this categorical_features = [‘column1’].

Ordinal features: returns true if there are ordinal features present. In order to encode ordinal features, we can pass a list like

ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] } as a parameter to encode them.The listing sequence should be from lowest to highest

Categorical imputation: String, default: Constant: Imputes constant by default. Another available option is ‘mode’ to impute it with most common categorical value. We can impute in this way

categorical_imputation = ‘mode’

Ordinal imputation: String, default: Mean : Imputes mean by default. The other available options are ‘median’ and ‘mode’.

ignore_features: string, default = None
If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically set to ignore for modeling.

normalize: bool, default = False
When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.

high_cardinality_features: string, default = None
When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality

I found these features to be quite useful and simplifies my data science task

Step 3: Creating a model

The create function creates a model on the given dataset. We can create a model using create_model() function.

create_model(model = None, fraction = 0.05, verbose=True, system = True, **kwargs)

In order to see all the models available in a module we can models() function.

Note: In order to use a particular model, we need to use ID given for the given module while creating the model.

Here in my case, I’m using iforest model for anomaly detection.

Here we need not mention the model name as it was already initialised during the setup

Step 4: assign the model:

In the setup stage, the data is marked as an outlier or inlier. The assign function flags them as 0 or 1. (outlier = 1 ,inlier = 0)

assign_model(model, transformation = False, verbose = True)

Step 5: Plot the model

Pycaret library has interactive plotly graph gives a 3D view of the graph

plot_model(model, plot=’tsne’, feature = None, save = False, system = True)
plot_model(iforest,save= True)

save: This attribute saves the plot as .HTML

Plot: By default, the plot is set ‘tsne’. Plot function has 2 plots for anomaly module

plot

Step 6: Predict the model

This function is used to predict new data using a trained model.

predict_model(model, data, platform=None, authentication=None)
predict

At last we can find label, Score columns added. The label is 0 for inliers and 1 for outliers

And that’s how I was able to predict in just 6 lines. This not increases computation speed but modeling quite simple.

Some useful functions that I found in other modules

AutoML

This function returns the best model out of all models created in the current active environment based on metric defined in optimize parameter. Run this code at the end of your script.

Note: This function is only available in classification and regression module

Output

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

Compare Model

This function returns the best model based on metric defined in sort parameter.

This function runs on the available models and gives the output for best fit model

You can find my Notebook here

If you find this useful, please give a clap

Resources

--

--