Less Known Applications of k-Means Clustering — Dimensionality Reduction, Anomaly Detection and Data Representation

Nitish Kumar Thakur
Analytics Vidhya
Published in
7 min readJul 9, 2021

k-Means is a data partitioning technique which is widely used for clustering. It has variants(like the mini-batch k-Means) which are incredibly fast for large amounts of data. Its clustering results are also easy to interpret. However, there are a lot of applications of k-Means which are not talked about a lot. They are:

  1. Non-Linear Dimensionality Reduction — It can be used to represent thousands of features using only a few “transformed” features — which can be used as engineered features in a ML Pipeline or can be used for Data Visualization.
  2. Multivariate Outlier/Anomaly Detection
  3. Data Representation(For Input to other Algorithms) —Generally, k-means is only able to detect spherical clusters. However, it is possible to feed the output of k-means to a Hierarchical Clustering algorithm to detect non-convex clusters of complicated shapes.

We will discuss the above applications in details with examples and Python code. We will use scikit-learn. Let us load the data First.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, preprocessing, model_selection, pipeline, ensemble, tree, datasets, cluster
sns.set(style = 'white', font_scale = 1.4)
######## Load the Data
data = pd.DataFrame(datasets.load_boston().data, columns = datasets.load_boston().feature_names)
y = datasets.load_boston().target
features = ['CRIM', 'LSTAT', 'RM', 'AGE', 'INDUS', 'NOX', 'DIS']
data = data[features]
data.head()

Non Linear Dimensionality Reduction using K-Means

The idea is to use k-Means to calculate the cluster centers, setting the number of clusters to the number of dimensions we want in our transformed data. For our example, since we want to reduce our data to 2 dimensions, we set the number of clusters to 2. The new features in the transformed space are the distances of each point from each cluster center.

Here is how we perform Non-Linear Dimensionality using K-Means:

  1. Fit k-Means to data. Set number of clusters to the number of dimensions we want to transform our data to — which is 2 in this case.
  2. Calculate the distance of each data point from each cluster center. This way, we get 2 features for each data point namely — (distance of point from cluster 1, distance of point from cluster 2). If we had k cluster centers, we would get k features per data point. These represent our transformed features. Luckily, the transform() method of the fitted K-means object does this for us.

Let us consider 7 features from the Boston Housing Dataset. We will attempt to reduce them to 2 dimensions using K-Means for Visualization.

Here is what the data looks like:

7 Features of the Boston Housing Data

We will Use Scikit-Learn Pipelines to perform scaling and fit k-Means.

#### Scale and Fit KMeans to data as part of the pipeline ####
kmeans = pipeline.make_pipeline(preprocessing.StandardScaler(), cluster.KMeans(n_clusters = 2)).fit(X_train)
# Store transformed dimensions
lower_dim = pd.DataFrame(kmeans.transform(X_train), columns = ['Comp 1', 'Comp 2'])
lower_dim.plot.scatter( x='Comp 1',y= 'Comp 2', grid = True, figsize = (10, 7))
2 Reduced Dimensions — using K-Means on the Boston Housing Data

In the above diagram, Component 1 measures the distance of each point from Cluster Center #1. Component 2 measures the distance of each point from Cluster Center #2. As we can see, we have some isolated points in the top right corner. As we will discuss later, points like this may be considered as outliers.

Following are some ways of Using the transformed features:

  1. Data Visualization: For data with hundreds of features, this allows us to visualize and code data on a 2D Screen. The data can be color-coded according to various quantities of interest.
  2. For Predictive Modeling: Here, we use the transformed features as inputs to a model. This is specially useful for Linear models because the features created by K-Means are non-linear transformations — and can help account for non-linear relations in data.
  3. Anomaly detection: The larger the distance of a point from the cluster centers, the more the more chances it has of being an anomaly.

Let us take a look at how to use it for Predictive Modeling. We use the sklearn pipelines. Let us first use Linear Regression on the original data WITHOUT performing the transformation:

### Train Test Split
X_train, X_test, y_train, y_test = model_selection.train_test_split(data, y, test_size = .2, random_state = 10)
### Fit a Linear Model
model = linear_model.LinearRegression()
score = model_selection.cross_val_score(model, X_train, y_train, cv = 10, scoring = 'r2')
print(f'Average r2: {np.mean(score)}')
Average r2: 0.6534847470004668

Let us now perform the transformation using K-Means and train a linear model on the transformed features. We use sklearn pipelines and transform data to have 5 features:

km = pipeline.make_pipeline(preprocessing.StandardScaler(), cluster.KMeans(n_clusters = 5),
linear_model.LinearRegression())
score = model_selection.cross_val_score(km, X_train, y_train, cv = 10, scoring = 'r2')
print(f'Average r2: {np.mean(score)}')
Average r2: 0.6751640572278654

In this case, the model performance improved on using 5 transformed features. However, the number of transformed features must be found via cross-validation or decided based on domain/business expertise in the general case.

Anomaly Detection using K-Means

The cluster centers identified by k-Means can be thought of “representatives” of data. The cluster centers along with the intra cluster variances summarize the data. Anomaly detection using k-Means is based on the idea that points that are very far from all of the cluster centers, are more anomalous. It makes sense because points that are far from the cluster centers are, on average far from the whole data. This is a multivariate method of Anomaly detection unlike tukey’s method or z-score method which are univariate methods.

Here is how we calculate the Anomaly score:

  1. Calculate the cluster centers using k-Means.
  2. Calculate the distance of each point from each cluster center using the transform method of cluster.KMeans() object.
  3. Take the minimum of these distances for each point i.e. we calculate the distance of each point from its closest cluster center. This distance is the anomaly score of the points. A larger anomaly score means the point is more anomalous. It is also possible to treat the average distance of each point from each cluster as the anomaly score.
## Fit KMeans after standardizing data
km = pipeline.make_pipeline(preprocessing.StandardScaler(), cluster.KMeans(n_clusters = 5)).fit(X_train)
### Calculate distance of each point from each cluster center
transformed = pd.DataFrame(km.transform(X_train))
# Calculate the Anomaly score - as the minimum distance of a point to any cluster center
anomaly_score = lower_dim.min(axis = 1)
plt.figure( figsize = (12, 7))
plt.scatter(x=lower_dim['Comp 1'],y= lower_dim['Comp 2'], c = anomaly_score, cmap = 'coolwarm')
plt.colorbar(label = 'Anomaly Score')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.grid()
As we see, the red points are the most anomalous — as correctly identified by this method.

The red points above are detected as anomalous as they have a higher anomaly score. This method, like many other methods, transforms data into new space and uses this representation to tag points as anomalous. For more analysis of anomaly detection methods like this, please check out my article:
https://medium.com/analytics-vidhya/anomaly-detection-in-python-part-1-basics-code-and-standard-algorithms-37d022cdbcff

Another way of performing anomaly detection with k-Means is to select a large number of clusters — large enough that some clusters have very few data points. Clusters having very few data points can be suspected to be anomalous — and must be further analyzed.

Input to other Algorithms

K-means can be used to build a “summarized” version of the data by representing it by the cluster centers. The cluster centers, in turn can be used as inputs to other algorithms. Following are some examples of how k-means cluster centers can be used as input to other algorithms:

1. K-Means detects “spherical” clusters. However, sometimes we need to detect non-convex shaped clusters which are not possible to properly detect through k-means alone. In these cases, we often use Hierarchical clustering with single linkage or other methods that are capable of detecting such clusters. However, methods like Hierarchical clustering take a long time to complete — rendering them impractical for large amounts of data. In these cases, it is possible to use first use k-means to cluster the data using a large number of clusters and using the cluster centers as input to a Hierarchical clustering algorithm(or any other clustering algorithm that can detect non-convex shaped clusters). This significantly reduces the time taken to cluster data and is also able to detect non-convex clusters.

2. K-Means cluster centers can be used as Initialization to GMMs(Gaussian Mixture Models). GMMs can be used for clustering and can be used to construct a Generative model for data.

Summary

To Summarize, k-means can be used for a variety of purposes. We can use it to perform dimensionality reduction where each transformed feature is the distance of the point from a cluster center. While using it to perform anomaly detection, we measure the distance of each point from its closest cluster center. If this distance is high enough, we label it as an anomaly. Finally, we saw how cluster centers can be used as inputs to algorithms as Hierarchical clustering and GMMs. There is a lot more to k-means in the regime of descriptive analytics and semi-supervised learning.

Hope you liked my article — please feel free to let me know if you have any feedback and check out other articles on DS and ML written by me here

--

--