Fundamentals of MLOps — Part 3 | ML Experimentation using PyCaret

Tezan Sahu
Analytics Vidhya
Published in
8 min readSep 5, 2021

--

In Part 2 of this 4-blog series, we gained some experience with DVC, a tool to implement efficient versioning of ML artifacts. With our data in place, the next steps in an ML workflow are to first perform EDA, followed by feature engineering & model training. Although they form a reasonably small portion of the entire infrastructure of an ML system, it is evident that these steps form the heart of the ML pipeline.

With increased competition across the globe, organizations are always attempting to create and deliver better solutions more quickly. To do so, making the iterative processes in an ML pipeline quicker, more robust & efficient is essential. Through this article, we will understand some of the hype around “no-code” & “low-code” Machine Learning, primarily aimed to automate ML pipelines, & in the process, dive into a Python library named PyCaret, which can help reduce the experimentation time with ML pipelines by leaps & bounds. So, let’s get started…

Contents

ML Pipelines

Firstly, let’s be clear with the notion of Machine Learning Pipelines since we have used this term previously without actually understanding it. In software development, the term ‘Pipeline’ draws its roots from the DevOps principles of CI/CD. It refers to a set of automated processes that allow developers and DevOps professionals to reliably and efficiently compile, build, and deploy their code to their production compute platforms. The processes can be thought of as modularized & composable blocks of code, that perform a specific task in the entire sequence.

Similarly in the MLOps world, ML Pipeline is essentially a technique of codifying & automating the ML workflow for a project to produce ML models for production. An end-to-end ML pipeline consists of the various sequential processes that handle everything: from data extraction & preprocessing, through model training & validation, to the final deployment.

Image Source: Pipelines for production ML systems

The major transformation that has been brought about by this concept is that now, teams do not build & maintain ML models, but rather focus & developing & maintaining an entire pipeline as a product, which serves as the blueprint for experimenting with & developing newer models with minor modifications. This ensures faster iteration cycles & allows for a greater degree of scalability.

No-Code & Low-Code ML

Initially, ML was (& even today, it is) a domain that required a fairly elaborate skillset & developing applications required proper coding skills & domain knowledge. But of late, with a surge in the popularity & utility of AI/ML applications, no individual has remained untouched by their impacts & everyone wishes to contribute & leverage the power of AI to build cool stuff. This has spurred the rise of several tools & platforms that intend to democratize the power of AI/ML & offer solutions that people without much background can use to actually develop their very own ML models & deploy them at scale. Such solutions include No-Code & Low-Code tools & platforms, that not only allow newbies (without ML expertise) to spin up their ML models but also allow experienced data scientists to significantly reduce the time they spend iterating & experimenting with ML models.

No-Code Platforms

No-code refers to a set of technologies that enable users to create apps and systems without having to write them traditionally. Instead, the main functionality is available via visual interfaces and guided user actions (such as drag-and-drop), as well as pre-built connections with other tools for information exchange as needed. Following are some of the No-Code ML platforms that have made the cut in recent times:

Image Source: Mapping the no-code AI landscape

Although they allow rapid creation of prototype models for non-programmers, the major drawback of such platforms is the limit on functionality & the loss of granular control (& hence the degree of customizability) over the algorithms that are used, because the user cannot make any changes to the packaged code available off-the-shelf.

Low-Code Platforms

The term “low-code” simply refers to a reduction in the coding effort. Such tools also offer many elements that may be dragged & dropped to create ML pipelines. However, there is a provision to alter them by writing some code. This allows much greater flexibility & customization that can be achieved to accomplish the required task. In practice, there is hardly any distinction between no-code and low-code platforms — platforms that advertise themselves as “no-code” also provide some room for customization (usually). Following are some low-code ML tools & platforms that are widely used in the industry:

Low-code platforms will never be able to completely replace hand-coded algorithms. They can, however, assist developers in taking ownership of modular blocks that perform some tasks in the ML workflow, to speed prototyping.

Low-Code ML with PyCaret

PyCaret is a Python-based open-source machine learning framework & end-to-end model management solution for automating machine learning workflows. It is a low-code package that can replace hundreds of lines of code with only a few lines, making experimentation exponentially faster. It is basically a wrapper around various machine learning libraries and frameworks, including scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc.

Modularized Features of PyCaret

Following are some of the basic features that PyCaret offers to be included in any ML Pipeline:

Image Source: PyCaret 101 — for beginners

PyCaret includes a wide range of preprocessing steps & automatic feature engineering that can be applied based on the type of task at hand & also allows ensembling of selected models using different techniques.

PyCaret Modules

Modules in PyCaret capture the type of task that one is expected to perform during experimentation. They allow the user to perform preprocessing accordingly & select models based on appropriate algorithms pertaining to the task. Following are the modules that are included within PyCaret:

In this post, we will understand the Regression module in detail. You can feel free to explore the functionalities provided by the other modules at your leisure.

Experimentation using PyCaret

Installation

Installation of PyCaret is easy and takes only a few minutes. All hard dependencies are also installed with PyCaret. It can be installed using pip.

When using such a package manager, it is advisable to create & enable a virtual environment to avoid potential conflicts with other packages

The latest release of PyCaret (2.3.1 at the time of writing this article)

$ pip install pycaret

Use the following commands if you wish to install PyCaret in a notebook:

pip install pycaret          # For local Jupyter notebook!pip install pycaret         # For Google Colab or Azure Notebooks

Building End-To-End ML Pipelines with PyCaret

Having installed PyCaret, we are now set to dive into all the cool functionality that is offered by this library. The following notebook contains a step-by-step tutorial to familiarize with the basics of PyCaret (through the pycaret.regression module & go up to an intermediate level in understanding & implementing the various building blocks of an end-to-end ML pipeline.

The Basic PyCaret section will walk you through the following:

  • PyCaret Environment Setup
  • Comparison of Model Algorithms
  • Training & Fine-Tuning a Model
  • Evaluation of a Model through Plots
  • Making Predictions using Trained Model
  • Saving & Loading a Model

The Intermediate PyCaret section will involve the following:

  • Data Transformation
  • Feature Engineering
  • Model Ensembling
  • Custom Grid Search in Hyperparameter Tuning

Link to PyCaret Tutorial Notebook

Closing Remarks

Having understood data versioning in the previous post, we tried to dive into building & automating ML pipelines in this article by first understanding low-code & no-code ML frameworks & then getting some hands-on training with the PyCaret library. We have explored PyCaret to a decent extent & can now experiment with end-to-end ML pipelines. The Jupyter Notebook presents an in-depth tutorial about the regression module.

The question that now remains to be answered is “How do we deployed these models & infer from them so that they can be used in the wild?” We will answer this question in our final post when we look at how to deploy our trained models using PyCaret on AWS, use MLFlow for logging our experiments & quickly spin up a web server for hosting our deployed model as an API for the users to obtain predictions.

Following are the other parts of this Fundamentals of MLOps series:

Thank you & Happy Coding!

If you enjoyed this article, I’m certain that you’d love my brand-new FREE AI Products & Research newsletter, “The Vision, Debugged”.

Subscribe & join the bandwagon of enthusiastic readers across top companies like Microsoft, Google, Walmart, Deloitte & more to get cool AI products & research insights, cheat sheets & resources.

About the Author

Hey folks!

I’m Tezan Sahu, an Applied Scientist at Microsoft, an Amazon #1 Bestselling Author (for the book “Beyond Code: A Practical Guide for Data Scientists, Analysts & Engineers”), and co-author of “The Vision, Debugged” newsletter.

I am passionate about helping aspiring data scientists & software developers kickstart their careers, deliver consistent impact & become differentiated professionals in the field of AI & Data Science.

If you are interested in learning more about how you can leverage AI to stay ahead of the curve and boost your results, connect with me on LinkedIn & subscribe to my newsletter.

--

--

Tezan Sahu
Analytics Vidhya

Applied Scientist @Microsoft | #1 Best Selling Author | IIT Bombay '21 | Helping Students & Professionals Ace Data Science Roles | https://topmate.io/tezan_sahu