K-Means Clustering Algorithm for Unsupervised Learning Tasks

K-Means Clustering Theory and Implementation with Scikit-Learn

Fares Sayah
7 min readFeb 6, 2023
Photo by Darya Tryfanava on Unsplash

K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been trained with labeled data. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the most relevant group.

For this project, we will attempt to use KMeans Clustering to cluster Universities into two groups, Private and Public.

It is very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the K-means algorithm under normal circumstances, it is because you don’t have labels. In this case, we will use the labels to try to get an idea of how well the algorithm performed, but you won’t usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don’t truly make sense in a real-world setting!.

Types Clustering:

Clustering algorithms are a type of unsupervised machine learning algorithms used to group similar data points into clusters based on their features. Here are a few common types of clustering algorithms, along with their benefits and use cases:

  1. K-Means Clustering: It is a simple and fast clustering algorithm, which partitions a dataset into k clusters. It is best suited for spherical-shaped clusters and continuous variables. It is widely used in image and pattern recognition, market segmentation, and customer profiling.
  2. Hierarchical Clustering: It is an iterative algorithm that builds a hierarchy of clusters, starting from individual data points and merging them into larger clusters. It is best suited for non-spherical-shaped clusters, data with a clear hierarchy, and large datasets.
  3. Density-Based Clustering: This type of clustering identifies clusters based on the density of data points. It is best suited for clusters with varying shapes and sizes, and for datasets with noise and outliers. Examples of algorithms include DBSCAN and HDBSCAN.
  4. Gaussian Mixture Model (GMM) Clustering: It is a probabilistic algorithm that models data as a mixture of Gaussian distributions. It is best suited for datasets with complex distributions and non-spherical-shaped clusters.
  5. Spectral Clustering: It is a graph-based clustering algorithm that uses the eigenvectors of a similarity matrix to identify clusters. It is best suited for datasets with non-linearly separable clusters, and for large datasets.

Each type of clustering algorithm has its strengths and weaknesses and the choice of algorithm will depend on the specific requirements of the problem being solved and the characteristics of the data.1. Import Libraries

Import the libraries you usually use for data analysis.

png
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 private 777 non-null object
1 apps 777 non-null int64
2 accept 777 non-null int64
3 enroll 777 non-null int64
4 top10perc 777 non-null int64
5 top25perc 777 non-null int64
6 f_undergrad 777 non-null int64
7 p_undergrad 777 non-null int64
8 outstate 777 non-null int64
9 room_board 777 non-null int64
10 books 777 non-null int64
11 personal 777 non-null int64
12 phd 777 non-null int64
13 terminal 777 non-null int64
14 s_f_ratio 777 non-null float64
15 perc_alumni 777 non-null int64
16 expend 777 non-null int64
17 grad_rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an important step in the process of using clustering algorithms for data analysis. The steps involved in EDA for clustering algorithms are:

  1. Data collection and preparation: Collect the relevant data and preprocess it to remove missing values, outliers, and irrelevant features.
  2. Data visualization: Plot the data using different visualization techniques such as histograms, scatter plots, density plots, etc., to understand the distribution of the data and identify patterns or anomalies.
  3. Dimensionality reduction: Reduce the dimensionality of the data, if necessary, to avoid the curse of dimensionality, which can negatively impact the clustering performance. Techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) can be used for dimensionality reduction.
  4. Data normalization: Normalize the data to ensure that features are on the same scale, as clustering algorithms are sensitive to the scale of the features.
  5. Distance metric selection: Select an appropriate distance metric that reflects the similarity between the data points, such as Euclidean distance, Manhattan distance, etc.
  6. The number of clusters selection: Determine the number of clusters that best represent the data. This can be done using techniques such as the elbow method, silhouette analysis, etc.
  7. Clustering: Apply the clustering algorithm to the data and obtain the cluster assignments.
  8. Clustering evaluation: Evaluate the clustering results using appropriate metrics such as accuracy, silhouette score, adjusted Rand index, etc.
  9. Cluster interpretation: Interpret the clusters and understand the patterns and characteristics of the data within each cluster.
  10. Refinement: Refine the clustering results, if necessary, by repeating steps 3–9 or using different clustering algorithms.

Note: The steps involved in EDA for clustering algorithms may vary depending on the specific data and use case, but these steps provide a general guideline for performing EDA in clustering algorithms.

png
png
png
png

Notice how there seems to be a private school with a graduation rate of higher than 100%. What is the name of that school?

png

Set that school’s graduation rate to 100 so it makes sense. You may get a warning, not an error) when doing this operation, so use DataFrame operations or just re-do the histogram visualization to make sure it actually went through.

png

K Means Cluster Creation

Now it is time to create the Cluster labels! Import KMeans from SciKit Learn.

KMeans(n_clusters=2)

What are the cluster center vectors?

array([[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,
5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
6.50926756e+01],
[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
6.75925926e+01]])

Evaluation

There is no perfect way to evaluate clustering if you don’t have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won’t have this luxury in the real world.

Create a new column for df called ‘Cluster’, which is a 1 for a Private school, and a 0 for a public school.

Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.

[[138  74]
[531 34]]

precision recall f1-score support
0 0.21 0.65 0.31 212
1 0.31 0.06 0.10 565
accuracy 0.22 777
macro avg 0.26 0.36 0.21 777
weighted avg 0.29 0.22 0.16 777
0.22136422136422138

0 1 accuracy macro avg weighted avg
precision 0.21 0.31 0.22 0.26 0.29
recall 0.65 0.06 0.22 0.36 0.22
f1-score 0.31 0.10 0.22 0.21 0.16
support 212.00 565.00 0.22 777.00 777.00

Scaling the data

[[-0.32661962 -0.30530339 -0.25143507 -0.49913749 -0.50267293 -0.22105053  -0.03848678 -0.45792911 -0.37283453 -0.12172587  0.04875211 -0.49062908  -0.47906195  0.23583839 -0.30713916 -0.42464126 -0.36177249]
[ 0.54548844 0.50988814 0.41992248 0.83361106 0.83951561 0.36917718 0.06427689 0.76478882 0.6226721 0.20329475 -0.08142105 0.81940114 0.80008284 -0.39387442 0.51295406 0.70919469 0.60419735]]

0.4774774774774775
[[146 66]
[340 225]]

precision recall f1-score support
0 0.30 0.69 0.42 212
1 0.77 0.40 0.53 565
accuracy 0.48 777
macro avg 0.54 0.54 0.47 777
weighted avg 0.64 0.48 0.50 777

Not so bad considering the algorithm uses the features to cluster the universities into 2 distinct groups! Hopefully, you can see how K Means is useful for clustering unlabeled data!

--

--

Fares Sayah

Data Scientist | Kaggle Master. Writing on data Science, Machine Learning, and Natural Language Processing