Clustering Made Easy with PyCaret
Low-code Machine Learning with a Powerful Python Library
The content of this article was originally published in my latest book, Simplifying Machine Learning with PyCaret. You can click here to learn more about it.
One of the fundamental tasks in unsupervised machine learning is clustering. The goal in this task is to categorize instances of a given dataset in different clusters, based on their common characteristics. Clustering has many practical applications in various fields, including market research, social network analysis, bioinformatics, medicine and others. In this article, we are going to examine a clustering case study by using PyCaret, a Python library that supports all basic machine learning tasks, such as regression, classification, clustering and anomaly detection. PyCaret simplifies the machine learning workflow by following a low-code approach, thus making it a great choice for beginners, as well as experts that want to quickly prototype ML models.
Software Requirements
The code in this article should work on all major operating systems, i.e. Microsoft Windows, Linux and Apple macOS. You will need to have Python 3 installed on your computer, as well as JupyterLab. I suggest that you use Anaconda, a machine learning and data science toolkit that includes numerous helpful libraries and software packages. Anaconda can be freely downloaded at this link. Alternatively, you can use a cloud service like Google Colab to run Python code without worrying about installing anything on your machine. You can either create a new Jupyter notebook and enter the code, or download it from this Github repository.
Installing PyCaret
The PyCaret library can be installed locally by executing the following command on your Anaconda terminal. You can also execute the same command on Google Colab or a similar service, to install the library on a remote server.
pip install pycaret[full]==2.3.4
After executing this command, PyCaret will be installed and you’ll be able to run all code examples of the article. It is recommended to install the optional dependencies as well, by including the [full]
specifier. Furthermore, installing the correct package version ensures maximum compatibility, as I used PyCaret ver. 2.3.4 while working on this article. Finally, creating a conda environment for PyCaret is considered to be best practice, as it will help you avoid conflicts with other packages and make sure you always have the correct dependencies installed.
K-Means Clustering
K-Means clustering¹ is one of the most popular and simplest clustering methods, making it easy to understand and implement in code. It is defined in the following formula.
K is the number of all clusters, while C represents each individual cluster. Our goal is to minimize W, which is the measure of within-cluster variation.
There are various ways to define within-cluster variation, but the most common one is squared euclidean distance, as seen in the above equation. This results in the following form of K-Means clustering, with W being replaced by the euclidean distance formula.
Clustering with PyCaret
K-Means is a widely used method, but there are numerous others available, such as Affinity Propagation², Spectral Clustering³, Agglomerative Clustering⁴, Mean Shift Clustering⁵ and Density-Based Spatial Clustering (DBSCAN)⁶. We are now going to see how the PyCaret clustering module can help us easily train a model and evaluate its performance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pycaret.clustering import *
from sklearn.datasets import make_blobs
mpl.rcParams['figure.dpi'] = 300
We begin by importing some standard Python libraries, including NumPy, pandas, Matplotlib and Seaborn. We also import the PyCaret clustering functions, as well as the make_blobs()
scikit-learn function that can be used to generate datasets. Finally, we set the Matplotlib figure DPI to 300, so we get high-resolution plots. Having this setting enabled isn’t necessary, so you can remove the last line if you want.
Generating a Synthetic Dataset
cols = ['column1', 'column2', 'column3',
'column4', 'column5']arr = make_blobs(n_samples = 1000, n_features = 5, random_state =20,
centers = 3, cluster_std = 1)data = pd.DataFrame(data = arr[0], columns = cols)
data.head()
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column1 1000 non-null float64
1 column2 1000 non-null float64
2 column3 1000 non-null float64
3 column4 1000 non-null float64
4 column5 1000 non-null float64
dtypes: float64(5)
memory usage: 39.2 KB
Instead of loading a real-world dataset, we are going to generate a synthetic one by using the make_blobs()
scikit-learn function. This function generates datasets that are suitable for clustering models, and has various parameters that can be modified according to our needs. In this case, we created a dataset with 1000 instances, 5 features and 3 distinct clusters. Using a synthetic dataset to test our clustering model has various benefits, the main being that we already know the actual number of clusters, so we can evaluate model performance easily. Real-world data are typically more complicated, i.e. they don’t always have clearly separated clusters, but working with a simple dataset lets you become acquainted with the tools and workflow.
Exploratory Data Analysis
data.hist(bins = 30, figsize = (12,10), grid = False)plt.show()
The hist()
pandas function lets us easily visualize the distribution of each variable. We can see that all variable distributions are either bimodal or multimodal, i.e. they have two or more peaks. This typically happens when the dataset contains multiple groups with different characteristics. In this case, the dataset was specifically created to contain 3 distinct clusters, so it is reasonable for the variables to have multimodal distributions.
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr().round(decimals=2), annot=True)plt.show()
We used the corr()
pandas function, as well as the heatmap()
Seaborn function to create a heatmap that visualizes the correlation values of all variable pairs. We can see that column2 and column3 have a strong linear relationship, with a correlation value of 0.93. The same is true for column3 and column4, having a correlation value of 0.75. On the other hand, column1 is inversely correlated with all the other columns, but especially column5, with a value of -0.85.
plot_kws = {'scatter_kws': {'s': 2}, 'line_kws': {'color': 'red'}}
sns.pairplot(data, kind='reg', vars=data.columns[:-1], plot_kws=plot_kws)plt.show()
We used the pairplot()
Seaborn function to create a scatter plot matrix for the synthetic dataset, with the diagonal having the histogram of each variable. The dataset clusters are visible in the scatter plots, indicating that they are clearly separated from each other. As it was previously observed in the correlation heatmap, we can see that some of the variable pairs have strong linear relationships, while others have inverse linear relationships. This is highlighted by the regression lines that have been included in each scatter plot, by setting the kind
parameter of the pairplot()
function to reg
.
Initializing the PyCaret Environment
cluster = setup(data, session_id = 7652)
After completing the Exploratory Data Analysis (EDA), we are now going to use the setup()
function to initialize the PyCaret environment. By doing this, a pipeline that prepares the data for model training and deployment will be created. In this case, the default settings are acceptable, so we aren’t going to modify any of the parameters. Regardless, this powerful function has numerous data preprocessing abilities, so you can refer to the documentation page of the PyCaret Clustering module to read more details about it.
Creating a Model
model = create_model('kmeans')
The create_model()
function lets us easily create and evaluate the clustering model of our preference, such as the K-Means algorithm. This function creates 4 clusters by default, so we could simply set the num_clusters
parameter to 3, as this is the correct number. Instead of doing that, we are going to follow an approach that generalizes for real-world datasets, where the cluster number is typically unknown. After executing the function, a number of performance metrics are printed, including Silhouette⁷, Calinski-Harabasz⁸ and Davies-Bouldin⁹. We are going to focus on the Silhouette Coefficient, which is defined in the following equation.
- s(i) is the Silhouette Coefficient of the dataset instance i.
- a(i) is the mean intra-cluster distance of i.
- b(i) is the mean nearest-cluster distance of i.
The resulting metric value is the mean Silhouette Coefficient of all instances, having a range between -1 and 1. Negative values indicate that an instance has been assigned to the wrong cluster, while values near 0 indicate that clusters are overlapping. On the other hand, positive values close to 1 indicate correct assignment. In our example, the value is 0.5822, suggesting that model performance can be improved by finding the optimal number of clusters for the dataset. Next, we are going to see how we can accomplish that by using the elbow method.
plot_model(model, 'elbow')
The plot_model()
function lets us create various useful graphs for our model. In this case, we created an elbow plot that will help us find the optimal number of clusters for the K-Means model. The elbow method trains the clustering model for a range of K values, and visualizes the distortion score for each one of them¹⁰. The point of inflection on the curve –known as the elbow– is an indication of the optimal value for K. As expected, the plot has an elbow at K = 3, highlighted by the dashed vertical line.
model = create_model('kmeans', num_clusters = 3)
After using the elbow method to find the optimal number of clusters, we train the K-Means model again. As we can see, the mean Silhouette Coefficient increased to 0.7972, indicating improved model performance and better cluster assignment for each dataset instance.
Plotting the Model
plot_model(model, 'cluster')
As seen earlier, plot_model()
is a useful function that can be used to plot various kinds of graphs for our clustering model. In this case, we created a 2D Principal Component Analysis (PCA) plot for the K-Means model. PCA can be used to project data to a lower-dimensional space while preserving most of the variance¹¹, a technique known as dimensionality reduction. After applying PCA to the synthetic dataset, the original 5 features have been reduced to 2 principal components. Furthermore, we can see that the clusters are clearly separated, and all of the dataset instances have been assigned to the correct cluster.
Saving and Assigning the Model
save_model(model, 'clustering_model')results = assign_model(model)
results.head(10)
The save_model()
function lets us save the clustering model to the local disk for future use or deployment as an application. The model is stored as a pickle file that can be loaded using the complementary load_model()
function. Furthermore, the assign_model()
function returns the synthetic dataset with an additional column for the cluster labels that were assigned to the dataset instances.
Conclusion
Hopefully, the case study I provided in this article will help you get a grasp of the PyCaret clustering module. I also encourage readers to experiment with other datasets and practice the aforementioned techniques by themselves. In case you want to learn more about PyCaret, you can check Simplifying Machine Learning with Pycaret, the book I recently published about the libary. Feel free to share your thoughts in the comments, or follow me on LinkedIn where I regularly post content about data science and other topics.
References
[1]: Steinley, Douglas. “K‐means clustering: a half‐century synthesis.” British Journal of Mathematical and Statistical Psychology 59.1 (2006): 1–34.
[2]: Dueck, Delbert. Affinity propagation: clustering data by passing messages. Toronto: University of Toronto, 2009.
[3]: Von Luxburg, Ulrike. “A tutorial on spectral clustering.” Statistics and computing 17.4 (2007): 395–416.
[4]: Ackermann, Marcel R., et al. “Analysis of agglomerative clustering.” Algorithmica 69.1 (2014): 184–215.
[5]: Derpanis, Konstantinos G. “Mean shift clustering.” Lecture Notes (2005): 32.
[6]: Khan, Kamran, et al. “DBSCAN: Past, present and future.” The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014). IEEE, 2014.
[7]: Rousseeuw, Peter J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of computational and applied mathematics 20 (1987): 53–65.
[8]: Caliński, Tadeusz, and Jerzy Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics-theory and Methods 3.1 (1974): 1–27.
[9]: Davies, David L., and Donald W. Bouldin. “A cluster separation measure.” IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224–227.
[10]: Yuan, Chunhui, and Haitao Yang. “Research on K-value selection method of K-means clustering algorithm.” J 2.2 (2019): 226–235.
[11]: Abdi, Hervé, and Lynne J. Williams. “Principal component analysis.” Wiley interdisciplinary reviews: computational statistics 2.4 (2010): 433–459.