Uncovering Hidden Patterns: A Clustering Exploration with PyCaret

Sneha Sunil
9 min readApr 12, 2023

--

Us humans are always looking for something to minimize the effort and time that we put in to get the work done efficiently. As data scientists, imagine if we had a tool that could make our lives easier in performing standard data science tasks like data preprocessing, building models, hyperparameter tuning, etc. This is where PyCaret comes into the picture.

PyCaret is a powerful and easy-to-use low code python library that is useful in automating the workflows for data science and machine learning. It can create a pipeline that can perform multiple tasks such as data cleaning, feature engineering, model training, etc. with just a few lines of code, which makes it simple to use. The users can now reproduce, test, and deploy data science workflows making lives easier for people who have no background in data science or no coding experience. This is helpful not just for newcomers, but also for experienced data science experts and professionals across various industries who would like to focus more on the business needs rather than coding, increasing their productivity.

As mentioned in the official documentation:

“PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive”

Wondering where the name PyCaret came from?

PyCaret is a package which was inspired by caret R package. The caret package in R automated most of the important tasks in comparing the performance of various machine learning algorithms on a given dataset.

In this tutorial, I will be exploring the different functionalities of PyCaret package in python! This blog is directed to all people who are interested in data science/machine learning and would like to spend less time coding and more time analyzing the problem at hand that we are trying to solve.

Note: The blog is mainly about exploring various functions in PyCaret package rather than focusing more on the machine learning problem itself.

Before getting into the details of PyCaret, let us review some basic types in machine learning. There are three main categories:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

In this tutorial, we will be working with a type of unsupervised machine learning called clustering on the iris dataset which can be obtained here. For people who are aware of the below terminologies, please feel free to skip this.

Unsupervised machine learning is used when we have a dataset at hand which doesn’t have labeled data. This type of learning will help in understanding the hidden patterns and structures in the data. Since there is no definite label on the response, the data will be grouped based on similar characteristics of each observation.

Clustering is one such technique of unsupervised machine learning in which the data points are grouped into different clusters depending on their similar features. There are multiple algorithms under clustering like k-means, hierarchical and density-based clustering. We will be specifically working with the k-means clustering algorithm. K-means is one of the most popular algorithms for unsupervised machine learning, which is widely used to segregate similar data.

So, Let’s get started!

Installation instructions:

First, we will have to install PyCaret in our system. It is always better to create a conda environment first and then install the PyCaret package in the environment as this helps in managing dependencies and isolating faults faster. Please refer to the code blocks provided below to ensure that your system has the necessary packages and tools required to utilize PyCaret. Note that for brevity, the commands are limited to Windows operating systems.

conda create --name {my_new_env}

This will create a new environment called ‘my_new_env’ in your system. You could also specify the python version for the environment by adding python as mentioned below.

conda create --name my_new_env python=3.X

Once the environment is set up, you have to activate the environment using the below command :

conda activate my_new_env

Next, you can install PyCaret in the newly activated environment:

pip install pycaret

Once the installation is complete, we can now import the required libraries as shown below in jupyterlab :

Importing relevant libraries

Let us first explore some of the important functions available in the PyCaret package. More details can be found here.

Brief summary of few functions in PyCaret

Step 1: Keeping data aside for clustering

Note: This is not considered as the standard train, test split as it is already being taken care in the ‘setup()’ function which we will check in the next step.

Check the below image to see how the iris data looks. Notice that I have removed the Id column as it does not add any value in training the model.

Initial view of iris data

Here, I will be separating the data in such a way that 95% of the data will be used for training and testing the model. The rest 5% will be unseen data that we will use later for labeling the cluster.

This can be performed by creating a sample out of the dataset using the ‘sample()’ function where we can provide the percentage of data needed for model training and testing.

Seperating data

Step 2: Initialize Pycaret:

The first and foremost step is execute the setup() function. This function is used to initialize the PyCaret environment. When this function is executed, a pipeline will be created which perform data preparation for training the model and deployment.

This is the first function to be executed in PyCaret and should be done before going forward. There is only one required parameter in this function — A pandas data frame. There are other parameters which are optional and can be used accordingly. This is a powerful function to also customize the data preprocessing features which can be found here. We also pass a session ID during setup which is a random number which will be used in all the other functions for the sake of reproducibility. Here we will be setting the value as 123.

Once we run the set up, PyCaret algorithm will detect the datatypes and present us with a table of Information which can be used to evaluate various details like if there are any missing values in the data, how many numeric and categorical features are available in the data, what is the sample size, presence of ordinal features etc. When there are missing values, it automatically filled with mean of the values for numeric features and ‘constant’ for categorical features.

Result table after setting up PyCaret

Step 3: Create the model

Creating a clustering model in PyCaret is simple and intuitive. We can see the list of all available models for a clustering problem by executing the ‘models()’ function as shown in the image. We can see multiple clustering models that are available in the library like k-means, ap, meanshift, dbscan, optics, etc.

For this tutorial, we will be using the kmeans model, as it is the most widely used algorithm for clustering in unsupervised machine learning.

Models for clustering available in library

We can easily create a model using the `create_model()` function which has one required parameter: the name of the algorithm.

Once this function is executed, PyCaret will train the model and display a table of clustering metrics. The Silhouette Score ranges from -1 to 1, with higher values indicating better clustering. It measures how similar an object is to its own cluster versus other clusters.

For simplicity, we will assume here that there are 3 clusters and pass the value for the num_clusters = 3 while creating the model. This can either be considered as a hyperparameter or initialized based on domain knowledge.

So, now we have a kmeans model with the number of clusters as 3. The default value for num_clusters is 4.

Create clustering model using kmeans
Created model

Step 4: Assign cluster labels to the data

The next step would be to assign the cluster labels to the data set using the model we created. This can be done using the ‘assign_model()’ function.

Assigning clusters to each datapoint

Once the function is executed, we can see that a new column ‘Cluster’ has been added to the data frame which has the values of assigned clusters — Cluster 0, Cluster 1, and Cluster 2 based on the underlying patterns in the data.

Step 4: Analyze the results using the plot function in PyCaret

Let us now use the plot_model() function in the package to check out various graphs of our model.

Elbow plot is a type of visualization available in PyCaret that will show us the optimal number of clusters fit for our data. It explains the association between the number of clusters and WCSS — Within cluster sum of squares which is the sum of the squared distance between each point and the center of the centroid it belongs to.

From the graph, we can see the optimal value of number of clusters as suggested by the plot is k = 4.

Silhouette plot in PyCaret is again a type of visualization available in PyCaret which is used to check the quality and validate the consistency of clustering by plotting the silhouette coefficient values for each cluster. The silhouette value measures the object’s cohesion (similarity with respect to the cluster) with its own cluster when compared to other clusters.

Silhouette plot of K means clustering

Step 5: Save the model for future use:

In case you want to use the model that we built in the future with new data, the entire pipeline need not be created again from scratch. Instead, the current model can be saved as a pickle(pkl) file which will contain the entire data preprocessing and model building pipeline which can be easily transferred into any environment. Isn’t that wonderful?

Saved model

As you can see from the below image, the Final model.pkl has been created once we saved the model.

Pickle file of the saved model

Step 6: Load the saved model

Since we have already saved our model for future use, let us test it out now! First, I will be loading the model using the name of the pkl file. We can perform this action by using the ‘load_model’ function as shown below:

Loading the model

Step 7: Use the loaded model to perfrom clustering

Once we load the clustering model that was saved, we can now perform predictions. I will be taking the unseen data that we had separated during the initial train and test split. For this, we will be using the predict_model() function of PyCaret, which will take the name of the saved model as one of the parameters and the data which needs to be clustered (a pandas data frame) as the second argument.

Perform clustering for unseen data

We can see above that the model grouped the unseen data into their respective clusters based on the trained model.

Conclusion

After going through some of the fundamental functions available in the PyCaret package, we have now got a sense of how beneficial and simple this particular package is. We did not have to write a lot of code in order to perform multiple tasks on machine learning. It automates most of the tasks like plotting, handling missing values, data preprocessing, feature engineering, hyper parameter tuning, etc. It can be transferred from one environment to another easily after saving the built model. All the tasks that we did using PyCaret can be stored as a pipeline which can be used for deployment directly. Last but not the least, it improves productivity of people. It is one of the most popular python libraries and you should definitely give it a try if you haven’t yet!

--

--