Effect of outliers on K-Means algorithm using Python

S Joel Franklin
Analytics Vidhya
Published in
4 min readNov 16, 2019
Photo by Jessica Ruscello on Unsplash

K-Means clustering is an unsupervised learning algorithm which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest centroid. The algorithm aims to minimize the squared Euclidean distances between the observation and the centroid of cluster to which it belongs.

But sometime K-Means algorithm does not give best results. It is sensitive to outliers. An outlier is a point which is different from the rest of data points. Let us look at one method for finding outliers of univariate data (one dimensional).

The lower quartile ‘Q1’ is median of first half of data. The upper quartile ‘Q3’ is median of second half of data. The interquartile range ‘IQR’ is difference of Q3 and Q1. An outlier is a point that is greater than (Q3 + 1.5*IQR) or lesser than (Q1–1.5*IQR). The given below code can be used to find the outliers.

Q1 = np.percentile(data, 25, interpolation = ‘midpoint’) # The lower quartile Q1 is calculated.Q3 = np.percentile(data, 75, interpolation = ‘midpoint’) # The upper quartile Q3 is calculated.IQR = Q3 — Q1 # The Interquartile range is calculated.Q3 + 1.5*IQR, Q1–1.5*IQR # The outlier range is calculated.

We shall discuss the methods for finding outliers of multivariate data in another article.

Let us take an example to understand how outliers affect the mean of data using python.

X = list(np.random.rand(100)) # ‘X’ is a list of 100 random numbers between 0 and 1.
Y = list(np.linspace(1,10,100)) # ‘Y’ is a list of 100 random numbers equally spaced between 1 and 10.
plt.figure(figsize=(20,10)) # Size of figure is adjusted.
plt.xticks(fontsize=20) # Size of number labels on x-axis is adjusted.
plt.yticks(fontsize=20) # Size of number labels on y-axis is adjusted.
plt.xlabel(‘X Values’,fontsize=20) # x-axis is labelled.
plt.ylabel(‘Y Values’,fontsize=20) # y-axis is labelled.
mean_X = sum(X)/len(X) # ‘mean_X’ is the mean value of ‘X’.
mean_Y = sum(Y)/len(Y) # ‘mean_Y’ is the mean value of ‘Y’.
plt.plot(mean_X,mean_Y,’ro’,markersize = 10) # The mean value (mean_X,mean_Y) point is plotted.
outlier = 1000 # An outlier of value 1000.
X.append(outlier) # The outlier is added to ‘X’.
Y.append(Y[99] + Y[1] — Y[0]) # An extra number is added to ‘Y’ such equal spacing still holds.
mean_X_new = sum(X)/len(X) # ‘mean_X_new’ is new mean value of ‘X’.
mean_Y_new = sum(Z)/len(Z) # ‘mean_Y_new’ is new mean value of ‘Y’.
plt.plot(mean_X_new,mean_Y_new,’go’,markersize = 10) # The mean value (mean_X,mean_Y) point is plotted in green.
The red point is mean of data excluding outlier. The green point is mean of data including outlier.

We observe that the outlier increases the mean of data by about 10 units. This is a significant increase considering the fact that all data points range from 0 to 1. This shows that the mean is influenced by outliers.

Since K-Means algorithm is about finding mean of clusters, the algorithm is influenced by outliers. Let us take an example to understand how outliers affect the K-Means algorithm using python.

We have a 2 dimensional data set called ‘cluster’ consisting of 3000 points with no outliers. We get the following scatter plot after K-means algorithm is applied.

Now we add 60 outliers to ‘cluster’ data set. The outliers is about 2 percent of non-outliers. We get the following scatter plots for different values of outliers after K-means algorithm is applied.

The outliers are not shown in the scatter plot. Only the 3000 non outlier points is shown in the scatter plot for sake of better visualisation. The outliers form a seperate cluster represented by centroid number = 3.
The outliers are not shown in the scatter plot. Only the 3000 non outlier points is shown in the scatter plot for sake of better visualisation. The outliers form a seperate cluster represented by centroid number = 3.

We observe that the outliers show up as a separate cluster and also cause other clusters to merge which suggests clustering was not efficient when outliers were included in data set.

Even though the outliers were about 2 percent of non-outliers which is common in real world data sets, they had a significant impact on clustering. Hence it is better to identify and remove outliers before applying K-means clustering algorithm. We would be looking at ways of identifying and removing outliers from datasets in subsequent articles.

--

--

S Joel Franklin
Analytics Vidhya

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning