K-Means Clustering Algorithm for Unsupervised Learning Tasks

K-Means Clustering Theory and Implementation with Scikit-Learn

7 min readFeb 6, 2023

Link to data used in this tutorial: College Data
Link to Full Notebook

K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been trained with labeled data. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the most relevant group.

For this project, we will attempt to use KMeans Clustering to cluster Universities into two groups, Private and Public.

It is very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the K-means algorithm under normal circumstances, it is because you don’t have labels. In this case, we will use the labels to try to get an idea of how well the algorithm performed, but you won’t usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don’t truly make sense in a real-world setting!.

Types Clustering:

Clustering algorithms are a type of unsupervised machine learning algorithms used to group similar data points into clusters based on their features. Here are a few common types of clustering algorithms, along with their benefits and use cases:

K-Means Clustering: It is a simple and fast clustering algorithm, which partitions a dataset into k clusters. It is best suited for spherical-shaped clusters and continuous variables. It is widely used in image and pattern recognition, market segmentation, and customer profiling.
Hierarchical Clustering: It is an iterative algorithm that builds a hierarchy of clusters, starting from individual data points and merging them into larger clusters. It is best suited for non-spherical-shaped clusters, data with a clear hierarchy, and large datasets.
Density-Based Clustering: This type of clustering identifies clusters based on the density of data points. It is best suited for clusters with varying shapes and sizes, and for datasets with noise and outliers. Examples of algorithms include DBSCAN and HDBSCAN.
Gaussian Mixture Model (GMM) Clustering: It is a probabilistic algorithm that models data as a mixture of Gaussian distributions. It is best suited for datasets with complex distributions and non-spherical-shaped clusters.
Spectral Clustering: It is a graph-based clustering algorithm that uses the eigenvectors of a similarity matrix to identify clusters. It is best suited for datasets with non-linearly separable clusters, and for large datasets.

Each type of clustering algorithm has its strengths and weaknesses and the choice of algorithm will depend on the specific requirements of the problem being solved and the characteristics of the data.1. Import Libraries

Import the libraries you usually use for data analysis.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   private      777 non-null    object 
 1   apps         777 non-null    int64  
 2   accept       777 non-null    int64  
 3   enroll       777 non-null    int64  
 4   top10perc    777 non-null    int64  
 5   top25perc    777 non-null    int64  
 6   f_undergrad  777 non-null    int64  
 7   p_undergrad  777 non-null    int64  
 8   outstate     777 non-null    int64  
 9   room_board   777 non-null    int64  
 10  books        777 non-null    int64  
 11  personal     777 non-null    int64  
 12  phd          777 non-null    int64  
 13  terminal     777 non-null    int64  
 14  s_f_ratio    777 non-null    float64
 15  perc_alumni  777 non-null    int64  
 16  expend       777 non-null    int64  
 17  grad_rate    777 non-null    int64  
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an important step in the process of using clustering algorithms for data analysis. The steps involved in EDA for clustering algorithms are:

Data collection and preparation: Collect the relevant data and preprocess it to remove missing values, outliers, and irrelevant features.
Data visualization: Plot the data using different visualization techniques such as histograms, scatter plots, density plots, etc., to understand the distribution of the data and identify patterns or anomalies.
Dimensionality reduction: Reduce the dimensionality of the data, if necessary, to avoid the curse of dimensionality, which can negatively impact the clustering performance. Techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) can be used for dimensionality reduction.
Data normalization: Normalize the data to ensure that features are on the same scale, as clustering algorithms are sensitive to the scale of the features.
Distance metric selection: Select an appropriate distance metric that reflects the similarity between the data points, such as Euclidean distance, Manhattan distance, etc.
The number of clusters selection: Determine the number of clusters that best represent the data. This can be done using techniques such as the elbow method, silhouette analysis, etc.
Clustering: Apply the clustering algorithm to the data and obtain the cluster assignments.
Clustering evaluation: Evaluate the clustering results using appropriate metrics such as accuracy, silhouette score, adjusted Rand index, etc.
Cluster interpretation: Interpret the clusters and understand the patterns and characteristics of the data within each cluster.
Refinement: Refine the clustering results, if necessary, by repeating steps 3–9 or using different clustering algorithms.

Note: The steps involved in EDA for clustering algorithms may vary depending on the specific data and use case, but these steps provide a general guideline for performing EDA in clustering algorithms.

Notice how there seems to be a private school with a graduation rate of higher than 100%. What is the name of that school?

Set that school’s graduation rate to 100 so it makes sense. You may get a warning, not an error) when doing this operation, so use DataFrame operations or just re-do the histogram visualization to make sure it actually went through.

K Means Cluster Creation

Now it is time to create the Cluster labels! Import KMeans from SciKit Learn.

KMeans(n_clusters=2)

What are the cluster center vectors?

array([[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,
        5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
        4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
        7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
        6.50926756e+01],
       [1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
        7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
        4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
        9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
        6.75925926e+01]])

Evaluation

There is no perfect way to evaluate clustering if you don’t have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won’t have this luxury in the real world.

Create a new column for df called ‘Cluster’, which is a 1 for a Private school, and a 0 for a public school.

Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.

[[138  74]
 [531  34]]

              precision    recall  f1-score   support
           0       0.21      0.65      0.31       212
           1       0.31      0.06      0.10       565
    accuracy                           0.22       777
   macro avg       0.26      0.36      0.21       777
weighted avg       0.29      0.22      0.16       777

0.22136422136422138

               0      1  accuracy  macro avg  weighted avg
precision   0.21   0.31      0.22       0.26          0.29
recall      0.65   0.06      0.22       0.36          0.22
f1-score    0.31   0.10      0.22       0.21          0.16
support   212.00 565.00      0.22     777.00        777.00

Scaling the data

[[-0.32661962 -0.30530339 -0.25143507 -0.49913749 -0.50267293 -0.22105053  -0.03848678 -0.45792911 -0.37283453 -0.12172587  0.04875211 -0.49062908  -0.47906195  0.23583839 -0.30713916 -0.42464126 -0.36177249]
 [ 0.54548844  0.50988814  0.41992248  0.83361106  0.83951561  0.36917718   0.06427689  0.76478882  0.6226721   0.20329475 -0.08142105  0.81940114   0.80008284 -0.39387442  0.51295406  0.70919469  0.60419735]]

0.4774774774774775
[[146  66]
 [340 225]]

              precision    recall  f1-score   support
           0       0.30      0.69      0.42       212
           1       0.77      0.40      0.53       565
    accuracy                           0.48       777
   macro avg       0.54      0.54      0.47       777
weighted avg       0.64      0.48      0.50       777

Not so bad considering the algorithm uses the features to cluster the universities into 2 distinct groups! Hopefully, you can see how K Means is useful for clustering unlabeled data!