5 Stages of Data Preprocessing for K-means clustering

3 min readJul 23, 2020

Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into an understandable format for ML algorithms. Real-world data usually is noisy (contains errors, outliers, duplicates), incomplete (some values are missed), could be stored in different places and different formats. The task of Data Preprocessing is to handle these issues.

In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and Training / Tunning Model.

Importance of Data Preprocessing stage

Different ML models have different required input data (numerical data, images in specific format, etc). Without the right data, nothing will work.
Because of “bad” data, ML models will not give any useful results, or even may give wrong answers, that may lead to wrong decisions (GIGO principle).
The higher the quality of the data, the less data is needed.

Note. Nowadays Preprocessing stage is the most laborious step, it may take 60–80% of ML Engineer efforts.

Before starting data preparation, it is recommended to determine what data requirements are presented by the ML algorithm for getting quality results. In this article we consider the K-means clustering algorithm.

K-means input data requirements:

Numerical variables only. K-means uses distance-based measurements to determine the similarity between data points. If you have categorical data, use K-modes clustering, if data is mixed, use K-prototype clustering.
Data has no noises or outliers. K-means is very sensitive to outliers and noisy data. More detail here and here.
Data has symmetric distribution of variables (it isn’t skewed). Real data always has outliers and noise, and it’s difficult to get rid of it. Transformation data to normal distribution helps to reduce the impact of these issues. In this way, it’s much easier for the algorithm to identify clusters.
Variables on the same scale — have the same mean and variance, usually in a range -1.0 to 1.0 (standardized data) or 0.0 to 1.0 (normalized data). For the ML algorithm to consider all attributes as equal, they must all have the same scale. More detail here and here.
There is no collinearity (a high level of correlation between two variables). Correlated variables are not useful for ML segmentation algorithms because they represent the same characteristic of a segment. So correlated variables are nothing but noise. More detail here.
Few numbers of dimensions. As the number of dimensions (variables) increases, a distance-based similarity measure converges to a constant value between any given examples. The more variables the more difficult to find strict differences between instances.

Note: What exactly does few numbers mean? It’s an open question for me. If you know the answer, please, let me know. For now, I stick to the rule — the less the better. Plus validation of the results.

Besides the requirements above, there are a few fundamental model assumptions:

the variance of the distribution of each attribute (variable) is spherical (or in other words, the boundaries between k-means clusters are linear);
all variables have the same variance;
each cluster has roughly equal numbers of observations.

These assumptions are beyond the data preprocessing stage. There is no way to validate them before getting model results.

Stages of Data preprocessing for K-means Clustering

Data Cleaning

Removing duplicates
Removing irrelevant observations and errors
Removing unnecessary columns
Handling inconsistent data
Handling outliers and noise

2. Handling missing data

3. Data Integration

4. Data Transformation

Feature Construction
Handling skewness
Data Scaling

5. Data Reduction

Removing dependent (highly correlated) variables
Feature selection
PCA

Data cleaning

Handling missing data

Data Integration

Combining data from different sources into a single, unified view.

Data Transformation

Data Reduction

References

Pandas Cheat Sheet. Data frames operations.