All in one clustering techniques in machine learning you should know in unsupervised learning
Clustering algorithms are a powerful technique for machine learning on unsupervised data. The most common algorithms in machine learning are hierarchical clustering and K-Means clustering. These two algorithms are incredibly powerful when applied to different machine learning problems.
Cluster analysis can be a powerful data-mining tool for any organization that needs to identify discrete groups of customers, sales transactions, or other types of behaviors and things. For example, insurance providers use cluster analysis to detect fraudulent claims, and banks use it for credit scoring.
Algorithm mostly in →
- Identification of fake news .
- Spam filter
- marketing sales
- Classify network traffic.
- Identifying fraudulent or criminal activity
Here are the topics we will be covering in this clustering techniques.
* PCA Decomposition (Dimensionality Reduction)
* K-Means Clustering (Centroid Based) Clustering
* Hierarchical (Divisive and Agglomerative) Clustering
* DBSCAN (Density Based) Clustering
# Importing Data
import pandas as pd
data = pd.read_csv(‘../input/unsupervised-learning-on-country-data/Country-data.csv’)
Let’s check the contents of data
data.head()
#describtion of dataset
data.describe()
What do the column headings mean? Let’s check the data dictionary.
import pandas as pd
data_dict = pd.read_csv(‘../input/unsupervised-learning-on-country-data/data-dictionary.csv’)
data_dict.head(10)
# Analyzing Data
#Data analysis baseline library
!pip install dabl
We are using Data Analysis Baseline Library here. It will help us analyze the data with respect to the target column.
import dabl
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings(‘ignore’)
plt.style.use(‘ggplot’)
plt.rcParams[‘figure.figsize’] = (12, 6)
dabl.plot(data, target_col = ‘gdpp’)
We can observe very close positive correlation between “Income” and “GDPP”. Also, “Exports”, “Imports”, “Health” have sort of positive correlation with “GDPP”.
However, we will now drop the column “Country” not because it is the only categorical (object type) parameter, but because it is not a deciding parameter to keep/not-keep a particular record within a cluster. In short, “Country” is a feature which is not required here for unsupervised learning.
# Exclude “Country” column
data = data.drop(‘country’, axis=1)
We will use simple profile reporting where we can get an easy overview of variables, and we can explore interactions (pair-wise scatter plots), correlations (Pearson’s, Spearman’s, Kendall’s, Phik), missing value information — all in one place. The output it produces is a bit long though, and we need to scroll down and toggle different tabs to view all the results, but the time you spend on it is worth it.
Gist of Overview:
- Average death of children under age 5 in every 100 people: 38.27
- Average life expectancy: 70.56 (highly negatively skewed distribution)
- Health has a perfectly symmetric distribution with mean 6.82
- Average exports of goods and services per capita: 41.11
- Average imports of goods and services per capita: 46.89 (which is > avg. exports)
- Average net income per person: 17144.69 (highly positively skewed distribution)
- Average inflation: 7.78 (has a wide spread ranging from min -4.21 till +104)
- Average GDP per capita: 12964.15 (highly negatively skewed distribution)
Gist of Interactions:
- Child Mortality has a perfect negative correlation with Life Expectancy
- Total Fertility has somewhat positive correlation with Child Mortality
- Exports and Imports have rough positive correlation
- Income and GDPP have fairly positive correlation
Gist of Missing Values:
- There is no missing value in data
We will discuss correlation coefficients in detail later.
#More prominent correlation plot
import numpy as np
import seaborn as sns
corr = data.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(12, 12))
cmap = sns.light_palette(‘black’, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=None, center=0,square=True, annot=True, linewidths=.5, cbar_kws={“shrink”: .9})
Insights from Pearson’s Correlation Coefficient Plot :
- Imports have high positive correlation with Exports (+0.74)
- Income has fairly high positive correlation with Exports (+0.52)
- Life Expectancy has fairly high positive correlation with Income (+0.61)
- Total Fertility has very high positive correlation with Child Mortality (+0.85)
- GDPP has very high positive correlation with Income (+0.90)
- GDPP has fairly high positive correlation with Life Expectancy (+0.60)
- Total Fertility has fairly high negative correlation with Life Expectancy (-0.76) — Well, I found this particular thing as an interesting insight but let’s not forget “Correlation does not imply causation”!
(1) Principal Component Analysis
Principal Component Analysis (PCA) is a popular technique for deriving a set of low dimensional features from a large set of variables. Sometimes reduced dimensional set of features can represent distinct no. of groups with similar characteristics. Hence PCA can be an insightful clustering tool (or a preprocessing tool before applying clustering as well). We will standardize our data first and will use the scaled data for all clustering works in future.
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
data_scaled=sc.fit_transform(data)
Here, I have used singular value decomposition solver “auto” to get the no. of principal components. You can also use solver “randomized” introducing a random state seed like “0” or “12345”.
from sklearn.decomposition import PCA
pc = PCA(svd_solver=’auto’)
pc.fit(data_scaled)
print(‘Total no. of principal components =’,pc.n_components_)#Print Principal Components
print(‘Principal Component Matrix :\n’,pc.components_)
Let us check the amount of variance explained by each principal component here. They will be arranged in decreasing order of their explained variance ratio.
#The amount of variance that each PC explains
var = pc.explained_variance_ratio_
print(var)#Plot explained variance ratio for each PC
plt.bar([i for i, _ in enumerate(var)],var,color=’black’)
plt.title(‘PCs and their Explained Variance Ratio’, fontsize=15)
plt.xlabel(‘Number of components’,fontsize=12)
plt.ylabel(‘Explained Variance Ratio’,fontsize=12)
Using these cumulative variance ratios for all PCs, we will now draw a scree plot. It is used to determine the number of principal components to keep in this principal component analysis.
# Scree Plot
plt.plot(cum_var, marker=’o’)
plt.title(‘Scree Plot: PCs and their Cumulative Explained Variance Ratio’,fontsize=15)
plt.xlabel(‘Number of components’,fontsize=12)
plt.ylabel(‘Cumulative Explained Variance Ratio’,fontsize=12)
The plot indicates the threshold of 90% is getting crossed at PC = 4. Ideally, we can keep 4 (or atmost 5) components here. Before PC = 5, the plot is following an upward trend. After crossing 5, it is almost steady. However, we have retailed all 9 PCs here to get the full data in results. And for visualization purpose in 2-D figure, we have plotted only PC1 vs PC2.
#Principal Component Data Decomposition
colnames = list(data.columns)
pca_data = pd.DataFrame({ ‘Features’:colnames,’PC1':pc.components_[0],’PC2':pc.components_[1],’PC3':pc.components_[2],
‘PC4’:pc.components_[3],’PC5':pc.components_[4], ‘PC6’:pc.components_[5], ‘PC7’:pc.components_[6],
‘PC8’:pc.components_[7], ‘PC9’:pc.components_[8]})
pca_data
we will se then
#Visualize 2 main PCs
fig = plt.figure(figsize = (12,6))
sns.scatterplot(pca_data.PC1, pca_data.PC2,hue=pca_data.Features,marker=’o’, s=500)
plt.title(‘PC1 vs PC2’,fontsize=15)
plt.xlabel(‘Principal Component 1’,fontsize=12)
plt.ylabel(‘Principal Component 2’,fontsize=12)
plt.show()
We can see that 1st Principal Component (X-axis) is gravitated mainly towards features like: life expectancy, gdpp, income. 2nd Principal Component (Y-axis) is gravitated predominantly towards features like: imports, exports.
#Export PCA results to file
pca_data.to_csv(“PCA_results.csv”, index=False)
(2) K-Means Clustering
This is the most popular method of clustering. It uses Euclidean distance between clusters in each iteration to decide a data point should belong to which cluster, and proceed accordingly. To decide how many no. of clusters to consider, we can employ several methods. The basic and most widely used method is **Elbow Curve**.
**Method-1: Plotting Elbow Curve**
In this curve, wherever we observe a “knee” like bent, we can take that number as the ideal no. of clusters to consider in K-Means algorithm.
from yellowbrick.cluster import KElbowVisualizer
#Plotting Elbow Curve
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn import metricsmodel = KMeans()
visualizer = KElbowVisualizer(model, k=(1,10))
visualizer.fit(data_scaled)
visualizer.show()
Here, along Y-axis, “distortion” is defined as “the sum of the squared differences between the observations and the corresponding centroid”. It is same as WCSS (Within-Cluster-Sum-of-Squares).
Let’s see the centroids of the clusters. Afterwards, we will fit our scaled data into a K-Means model having 3 clusters, and then label each data point (each record) to one of these 3 clusters.
#Fitting data into K-Means model with 3 clusters
km_3=KMeans(n_clusters=3,random_state=12345)
km_3.fit(data_scaled)
print(km_3.cluster_centers_)
We can see each record has got a label among 0,1,2. This label is each of their cluster_id i.e. in which cluster they belong to. We can count the records in each cluster now.
pd.Series(km_3.labels_).value_counts()
We see, the highest no. of records belong to the first cluster.
Now, we are interested to check how good is our K-Means clustering model. Silhouette Coefficient is one such metric to check that. The **Silhouette Coefficient** is calculated using:
* the mean intra-cluster distance ( a ) for each sample
* the mean nearest-cluster distance ( b ) for each sample
* The Silhouette Coefficient for a sample is (b — a) / max(a, b)
# calculate Silhouette Coefficient for K=3
from sklearn import metrics
metrics.silhouette_score(data_scaled, km_3.labels_)
# calculate SC for K=2 through K=10
k_range = range(2, 10)
scores = []
for k in k_range:
km = KMeans(n_clusters=k, random_state=12345)
km.fit(data_scaled)
scores.append(metrics.silhouette_score(data_scaled, km.labels_))
We observe the highest silhouette score with no. of clusters 3 and 4. However, from Elbow Curve, we got to see the “knee” like bent at no. of clusters 3. So we will do further analysis to choose the ideal no. of clusters between 3 and 4.
For further analysis, we will consider **Davies-Bouldin Score** apart from Silhouette Score. **Davies-Bouldin Score** is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score.
We will also analyze **SSE (Sum of Squared Errors)**. SSE is the sum of the squared differences between each observation and its cluster’s mean. It can be used as a measure of variation within a cluster. If all cases within a cluster are identical the SSE would then be equal to 0. The formula for SSE is: 1
Method-2: Plotting of SSE, Davies-Bouldin Scores, Silhouette Scores to Decide Ideal No. of Clusters
from sklearn.metrics import davies_bouldin_score, silhouette_score, silhouette_samples
sse,db,slc = {}, {}, {}
for k in range(2, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=12345).fit(data_scaled)
if k == 4: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(data_scaled,clusters)
slc[k] = silhouette_score(data_scaled,clusters)#Plotting SSE
plt.figure(figsize=(12,6))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel(“Number of cluster”, fontsize=12)
plt.ylabel(“SSE (Sum of Squared Errors)”, fontsize=12)
plt.title(“Sum of Squared Errors vs No. of Clusters”, fontsize=15)
plt.show()
We can see “knee” like bent at both 3 and 4, still considering no. of clusters = 4 seems a better choice, because after 4, there is no further “knee” like bent observed. Still, we will analyse further to decide between 3 and 4.
#Plotting Davies-Bouldin Scores
plt.figure(figsize=(12,6))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel(“Number of cluster”, fontsize=12)
plt.ylabel(“Davies-Bouldin values”, fontsize=12)
plt.title(“Davies-Bouldin Scores vs No. of Clusters”, fontsize=15)
plt.show()
clearly no choice for =3 in best choice.
plt.figure(figsize=(12,6))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel(“Number of cluster”, fontsize=12)
plt.ylabel(“Silhouette Score”, fontsize=12)
plt.title(“Silhouette Score vs No. of Clusters”, fontsize=15)
plt.show()
No. of clusters = 3 seems the best choice here as well. The silhouette score ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. A score nearly 0.28 seems a good one.
Silhouette Plots for Different No. of Clusters :**
We will now draw Silhouette Plots for different no. of clusters for getting more insights. Side by side, we will observe the shape of the clusters in 2-dimensional figure.
#Silhouette Plots for Different No. of Clusters
import matplotlib.cm as cm
import numpy as np
for n_clusters in range(2, 10):
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 8)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but here the range is from -0.2 till 1
ax1.set_xlim([-0.2, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(data_scaled) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator seed of 12345 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters,max_iter=1000, random_state=12345)
cluster_labels = clusterer.fit_predict(data_scaled)
# The silhouette_score gives the average value for all the samples
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg = silhouette_score(data_scaled, cluster_labels)
print(“For n_clusters =”, n_clusters, “The average silhouette_score is :”, silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(data_scaled, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to cluster i and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))# Compute the new y_lower for next plot
y_lower = y_upper + 10ax1.set_title(“The silhouette plot for the various clusters.”)
ax1.set_xlabel(“The silhouette coefficient values”)
ax1.set_ylabel(“Cluster label”)# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color=”red”, linestyle=” — “)ax1.set_yticks([]) # Clear the yaxis labels
ax1.set_xticks([-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(data_scaled[:, 0], data_scaled[:, 1], marker=’.’, s=30, lw=0, alpha=0.7, c=colors, edgecolor=’k’)# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker=’o’, c=”white”, alpha=1, s=200, edgecolor=’k’)for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker=’$%d$’ % i, alpha=1, s=50, edgecolor=’k’)ax2.set_title(“The visualization of the clustered data.”)
ax2.set_xlabel(“Feature space for the 1st feature”)
ax2.set_ylabel(“Feature space for the 2nd feature”)plt.suptitle((“Silhouette analysis for KMeans clustering on sample data with n_clusters = %d” % n_clusters), fontsize=14, fontweight=’bold’)
plt.show()
preds = km_3.labels_
data_df = pd.DataFrame(data)
data_df[‘KM_Clusters’] = preds
data_df.head(10)
We will visualize 3 clusters now for various pairs of features. Initially, I chose the pairs randomly. Later, I chose the pairs including “GDPP”, “income”, “inflation” etc. important features. Since we are concerned about analyzing country profiles and “GDPP” is the main indicator to represent a country’s status, we are concerned with that mainly.
#Visualize clusters: Feature Pair-1
import matplotlib.pyplot as plt_1
plt_1.rcParams[‘axes.facecolor’] = ‘lightblue’
plt_1.figure(figsize=(12,6))
plt_1.scatter(data_scaled[:,0],data_scaled[:,1],c=cluster_labels) #child mortality vs exports
plt_1.title(“Child Mortality vs Exports (Visualize KMeans Clusters)”, fontsize=15)
plt_1.xlabel(“Child Mortality”, fontsize=12)
plt_1.ylabel(“Exports”, fontsize=12)
plt_1.rcParams[‘axes.facecolor’] = ‘lightblue’
plt_1.show()
Hierarchical clustering
There are two types of hierarchical clustering: **Divisive** and **Agglomerative**. In divisive (top-down) clustering method, all observations are assigned to a single cluster and then that cluster is partitioned to two least similar clusters, and then those two clusters are partitioned again to multiple clusters, and thus the process go on. In agglomerative (bottom-up), the opposite approach is followed. Here, the ideal no. of clusters is decided by **dendrogram**.
Method-1: Dendrogram Plotting using Clustermap
import seaborn as sns
cmap = sns.cubehelix_palette(as_cmap=True, rot=-.3, light=1)
g = sns.clustermap(data_scaled, cmap=cmap, linewidths=.5)
From above dendrogram, we can consider 2 clusters at minimum or 6 clusters at maximum. We will again cross-check the dendrogram using **Ward’s Method**. Ward’s method is an alternative to single-link clustering. This algorithm works for finding a partition with small sum of squares (to minimise the within-cluster-variance).
Method-2: Dendrogram Plotting using Ward’s Method
# Using the dendrogram to find the optimal number of clusters
import scipy.cluster.hierarchy as schplt.rcParams[‘axes.facecolor’] = ‘white’
plt.rcParams[‘axes.grid’] = False
dendrogram = sch.dendrogram(sch.linkage(data_scaled, method=’ward’))
plt.title(“Dendrogram using Ward’s Method”, fontsize=15)
plt.xlabel(‘Clusters’, fontsize=12)
plt.ylabel(‘Euclidean distances’, fontsize=12)
plt.rcParams[‘axes.facecolor’] = ‘white’
plt.rcParams[‘axes.grid’] = False
plt.show()
DBSCAN Clustering
DBSCAN is an abbreviation of “Density-based spatial clustering of applications with noise”. This algorithm groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks noise as outliers (noise means the points which are in low-density regions).
I found an interesting result with DBSCAN when I used all features of country data. It gave me a single cluster.** I presume, that was very evident to happen because our data is almost evenly spread, so density wise, this algorithm could not bifurcate the datapoints into more than one cluster. Hence, I used only the features which have high correlation with “GDPP”. I also kept “Child Mortality” and “Total Fertility” in my working dataset since they have polarizations — some data points have extremely high values, some have extremely low values (ref. to corresponding scatter plots in data profiling section in the beginning).
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScalerClus_dataSet = data[[‘child_mort’,’exports’,’health’,’imports’,’income’,’inflation’,’life_expec’,’total_fer’,’gdpp’]]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = np.array(Clus_dataSet, dtype=np.float64)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)# Compute DBSCAN
db = DBSCAN(eps=1, min_samples=3).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
#data[‘Clus_Db’]=labelsrealClusterNum=len(set(labels)) — (1 if -1 in labels else 0)
clusterNum = len(set(labels))# A sample of clusters
print(data[[‘child_mort’,’exports’,’health’,’imports’,’income’,’inflation’,’life_expec’,’total_fer’,’gdpp’]].head())# number of labels
print(“number of labels: “, set(labels))
We have got 7 clusters using density based clustering which is a distinct observation (7 is much higher than 3 which we got in all three different clustering algorithms we used earlier).
# save the cluster labels and sort by cluster
datacopy = data.copy()
datacopy = datacopy.drop(‘KM_Clusters’, axis=1)
datacopy[‘DB_cluster’] = db.labels_#Visualize clusters: Random Feature Pair-1 (income vs gdpp)
import matplotlib.pyplot as plt_3
plt_3.rcParams[‘axes.facecolor’] = ‘orange’
plt_3.figure(figsize=(12,6))
plt_3.scatter(datacopy[‘income’],datacopy[‘gdpp’],c=db.labels_)
plt_3.title(‘Income vs GDPP (Visualize DBSCAN Clusters)’, fontsize=15)
plt_3.xlabel(“Income”, fontsize=12)
plt_3.ylabel(“GDPP”, fontsize=12)
plt_3.rcParams[‘axes.facecolor’] = ‘orange’
plt_3.show()#Visualize clusters: Random Feature Pair-2 (inflation vs gdpp)
import matplotlib.pyplot as plt_3
plt_3.figure(figsize=(12,6))
plt_3.scatter(datacopy[‘inflation’],datacopy[‘gdpp’],c=db.labels_)
plt_3.title(‘Inflation vs GDPP (Visualize DBSCAN Clusters)’, fontsize=15)
plt_3.xlabel(“Inflation”, fontsize=12)
plt_3.ylabel(“GDPP”, fontsize=12)
plt_3.rcParams[‘axes.facecolor’] = ‘orange’
plt_3.show()
If you want to implement code from your hand then click the button and implement code end to end with explanation in brief.
If u like to read this article and have common interest in similar projects then we can grow our network and can work for more real time projects.
For more details connect with me on my Linkedin account!
THANKS!!!!