ML Prototyping Made Easy

End-to-end Low Code ML using PyCaret

One-Stop Solution for different ML Problems!

Shashvat G
Coding in Simple English

--

Photo by Edward Howell on Unsplash

Have you ever wanted a simple and easy to use python library bundled in a single package that can handle an end-to-end Machine learning journey with a consistent syntax interface? If you are working on a PoC, PyCaret can be a handy tool as it provides a one-stop solution for a variety of problems. Be it removing multi-collinearity from your dataset or impute missing values, it has got your back.

PyCaret can significantly reduce the time and efforts required to build and train ML models and all this can be achieved in only a couple of function calls. It has some interesting built-in functionality for Feature engineering, Hyper Parameter Tuning, etc. It is also capable of comparing models, stack, and ensemble them based on your requirement. Furthermore, it can also be integrated with ML Flow to track and manage ML experiments. You can also use it to deploy on different cloud platforms like Azure or AWS. Usually, we have defined the machine learning pipeline that we follow in creating a model. PyCaret can do all these heavy liftings for you and it does all that under the hood.

So, you would ask — What is PyCaret?

PyCaret is an open-source, low-code machine learning library in Python that does the heavy lifting for you right from data preparation to model deployment quickly by writing only a few lines of code.

PyCaret can be used for different machine learning projects ex- Classification, Regression, and Anomaly Detection. For instance, to create a decision tree model in PyCaret, only one line code is sufficient. You just need to be careful about passing the correct ID as a parameter.

Model= create_model(‘dt’)

This post intends to provide a step-by-step tutorial for implementing the classification problem in machine learning. Do check out the tutorials in the reference section for regression and anomaly detection using PyCaret.

Pre-requisites

  • Python 3.x
  • The latest version of PyCaret
  • Some understanding of ML

A quick overview of the problem and dataset:

  • Dataset used: Pycaret in-built dataset
  • Problem Definition: To be able to detect Fraudulent transactions in credit cards.
  • Challenges: Imbalance Data
  • Target: Label 1/0 for Fraud/Not Fraud
  • ML Problem — Binay Classification
  • Dataset — 24000 rows, 24 columns

Installation

We can simply install it with the python package manager. It is recommended to use virtual environments to avoid any version compatibility issues. Check this out if you face any issues while installation.

$ pip install pycaret==2.1 (Latest Version)

Now that we have some background, Let’s dive straight in. We will follow these steps to build and train a classifier that can detect fraudulent transactions.

  1. Read the Dataset — Similar to SKlearn, it has some in-built datasets that have been used here for demonstration.
  2. Initialize Set Up — It can infer Datatypes of the features in the dataset, check for missing values, impute missing values with mean or constant value depending upon continuous or categorical features.

Fix imbalance argument can automatically fix the imbalanced dataset by oversampling using the SMOTE. We can also optionally pass the variables which we want to exclude in the ignore_features argument of the setup function.

3. Compare Baseline / Create a Model — Returns the best model amongst the list of models (Try models() to see a list of models available).

By default, it returns results sorted based on Accuracy. create_model creates only a single model, hence takes less time as compared to compare models.

4. Tune Hyperparameters — Tune model feature for Hyper Parameter Tuning.

For instance- you can specify the number of neighbors if building a K-NN Classifier. You can optionally optimize for your desired criteria.

5. Finalize Model — Model finalization is the last step in the experiment.

The finalize_model() function fits the model onto the complete dataset including the hold-out sample.

6. Analyze Model plot_model() function can be used to analyze the performance through AUC, confusion_matrix, learning curve, etc. This function takes a trained model object and returns a plot.

7. Save Model — PyCaret’s inbuilt function allows you to save the model along with the entire transformation pipeline for later use. You can also use it to load the saved model.

The entire experiment notebook is available here. Feel free to fork and play around.

Bonus

If you want a pandas profiling report for the dataset, you can simply add an argument in the setup function `profile = True` and you’ll have a pandas profiling report at your disposal.

Parting Thoughts

PyCaret is a low-code ML library that can be used to quickly build and train a model end-to-end which can be super useful while working o a PoC. We can avoid importing dozens of python packages and all this can be achieved in only a couple of function calls. It also provides integration with MLFlow and deployment on Azure and AWS.

Some of the functionalities worth exploring are the Plot model feature that produces a nice visualization of Feature Importance and slightly advanced techniques like Stacking and ensemble feature methods to improve model performance. Here’s the blog post that contains a nice and detailed explanation from the founder of PyCaret.

I’d love to hear your thoughts about PyCaret and Machine learning in the comments below!

References:

--

--

Shashvat G
Coding in Simple English

Data Scientist | Analyst who aspires to continuously learn and grow in Data Science Space. Find him on LinkedIn https://www.linkedin.com/in/shashvat-gupta/