K-means clustering Algorithm

Jared
11 min readDec 21, 2018

--

Hello World! It’s Jared here again with another blog post. This time around I am writing about the K-means clustering Algorithm.

Basic Overview:

The goal of the K-means algorithm is to find groups in the data, with the number of groups represented by the variable k . The algorithm works iteratively to assign each data point in the data set to one of k groups based on the features that are provided. Meaning that the data points are clustered around their respective centroid. By definition, each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents. In layman’s terms it means that the centroids can be used to label new data, hence proving the ability to classify or discover new groups that weren’t obvious beforehand. K-means algorithm works differently than other machine learning algorithms because, unlike its counterparts, it is a unsupervised learning algorithm.

Unsupervised learning algorithms learn from test data that has not been labeled, classified, or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities within the data set. K-means clustering allows us to find and analyze the groups that have formed organically from the data set. Since most data isn’t labeled this algorithm is perfect for identifying commonalities and providing classifications for the data scientist to visualize.

Advantages and Disadvantages of k-means clustering:

Some of the advantages of k-means clustering include: it being a widely used method for cluster analysis; it’s easy to understand; and it trains easily.

Some of the disadvantages o k-means clustering include: the Euclidean distance is not ideal in many applications; performance is not competitive with the best clustering methods; small variations in the data can result in a completely different clusters (high variance); and clusters are assumed to have spherical shape and be evenly sized.

How does the k-means clustering algorithm work?

The k-means clustering algorithm works by iterative refinement to produce the final results. First the data scientist must choose the optimal value for k to get the best classification. If the wrong value for k is chosen then the classification will be incorrect. There are different methods for finding the optimal value for k, which we will discuss later. After choosing the value for k, we need to plot k number of centroids at random positions in the graph. Let’s call the next step the data assignment step. During the data assignment step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. The dist(c[i], x)² below represents the squared Euclidean distance from centroid C[i] to each point in the array X , which holds all the values for the points in our data set. We test the Euclidean distance from each point to each centroid on our graph. But we only want the place in our array of the minimum distance, which is why it is preceded by argmin.

Squared Euclidean Distance Formula represented by dist

After the data assignment step we transition to the Centroid update step. During the centroid update step, the centroids are recomputed, by taking the mean of all data points assigned to that centroid. Then we assign these mean values as the new locations for their respected centroids.

The mean of all points in a cluster is assigned to the centroid of that cluster

We continue to loop through the data assignment step and the centroid update step until there are no change in clusters, the sum of distances is minimized, or some maximum number of iterations is reached. The algorithm is guaranteed to converge to a result, even though the result may not necessarily be the best possible outcome. This could be the result of non optimal data set or because we chose the wrong value for k.

Demonstration of k-means clustering through each iteration

Choosing the optimal value for k and the “elbow point”

As mentioned earlier, as data scientist we need to be able to determine the most optimal value for k. Increasing k (number of centroids) will always reduce the distance to data points, but if our value for k is too high then we risk having too many clusters for our particular data set. For example, we could have 3 obvious clusters in our data set (where each one represents a feature) but if k = 5 then we would have 5 clusters for a data set that realistically only has 3 clusters. This will ruin the classification and give us the wrong conclusion for our data. We could also go to the extreme, where the mean distance is equal to zero, if our value for k is the same number as the amount of data points. Instead, mean distance to the centroid as a function of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to roughly determine k.

Elbow method (clustering) and the “elbow point”

The elbow method is a method of interpretation and validation of consistency, within cluster analysis, designed to help find the appropriate number of clusters in a data set. Refer to the graph above. The y-axis is the sum of squared mean distance errors from the k number of centroids to each point in their respective clusters. The x-axis is the number of clusters k. If we plot all the sum of squared mean distance errors for each value of k, we will get a graph like the one above.

Our goal is to choose a small value of k that has a low sum of squared error, right before the rest of the k values start to have diminishing return. This usually happens at the “elbow point” of the graph, where the angle change can result in a considerable change in sum of squared errors. In other words, if we were to increase the value of k from 3, the sum of squared errors would be substantially larger. And if we were to decrease the value of k from 3, the sum of squared errors would decrease but at the cost of diminishing return. This is because we could setk = 10 and have the lowest sum of squared errors. But it would be computationally inefficient because we would have to do 10 iterations for a small improvement in error.

The elbow method doesn’t always work well; especially if the data is not very clustered. Fret not, because there are a number of other techniques for validating k, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm.

How to code the k-means algorithm in Python?

In order to visualize the scatter plot graph we will need to use Jupyter notebook. So if you don’t have it installed please google it and download it. We will also be using NumPy, Pandas, and MatPlotLib libraries to write this code as well.

Setting up libraries and reading data from .csv file

The first thing that we do is import the libraries that we will be using. We import the NumPy library, which grants us support to mathematical functions to operate on multi dimensional arrays and matrices. Then we import the Pandas library, for data manipulation and analysis. And last but not least, we import matplotlib.pyplot for interactive plots and programmatic plot generation. %matplotlib inline is a magic function in IPython, that sets the backend of matplotlib to the ‘inline’ backend. This magic function only works in IPython which is why we are using Jupyter notebook.

Then we simply read the data from our csv file with the following function pd.read_csv('xclara.csv') then we assign the data read to our variable data. Later we can manipulate the data variable and create arrays for the V1 and V2 data points.

Importing libraries and reading dataset from .csv file

Creating Array of Lists for our data and plotting the points

In this next part of our code all we have to do is move all the V1 and V2 values into our f1 and f2 variables respectively. f1 = data['V1'].values returns a numpy representation of all the values in that DataFrame, ignoring the label V1 . Then we zip together the f1 and f2 features, and create an numpy array of lists (i.e. 2D array). We will use this array of list later, to calculate the Euclidean distance from each centroid to every point. We print(X) to give us an idea of how our points are stored in the array.

Creating the array of lists to hold all of our points and plotting the points; also shows the output of print(X)

The matplotlib.scatter function will take the array of x-values and the array of y-values, and create a scatter plot graph with all the value points. We can also assign the color and size that we want the points to be. Below is the result of the statement plt.scatter(f1, f2, color = 'black', s = 7), which plots all the points in our data set, in size 7 and color black.

Scatter plot of all the values in our data set

Euclidean Distance Function and Initializing our Centroids

As you can see from the scatter plot above, we have 3 organic clusters of data points. So we will set our value of k to k = 3 . This isn’t always the case, because our data won’t always be this organized and maybe we will be looking to classify multiple features that could be morphed into one cluster. But to make things simple we will assign it to 3.

We also define the distance function def dist(a, b, ax = 1): . This function takes in two vectors as parameters and calculates the Euclidean distance, using another function np.linalg.norm(a — b, axis = ax) . This function is part of the numpy’s linear algebra module, the norm function within this module takes a vector as an input and returns a scalar value(ex. the Euclidean distance) that can be interpreted as the “size”, “length”, or “magnitude” of that vector. In our case we use the difference between two vectors a-b as a parameter to find the Euclidean distance between two points. This is also known as the L² norm, but the norm function can also be used to find the max norm, the Manhattan norm, etc. But for our purposes we only need to know how to use it for the L² norm (Euclidean distance).

After we define our distance function, we use the np.random.randint(0, np.max(X)-20, size = k)function to create a random number for the centroids x and y values respectively. The prototype for this function is numpy.random.randint(low, high, size, dtype). Which means that the function in our code creates a random number from 0 to np.max(X) — 20 (the max number in our X array of lists minus 20). The size is size = k which means the the function will create k number of random numbers and return them as a array. So C_x and C_y are both arrays of 3 random integers. In the next step we zip the two arrays together and store them in C as a array of lists (i.e. 2D array). Then we scatter plot the points in our data set again and plot the randomly generated centroids(the green stars) as well.

Scatter plot of all the points in the dataset, and the random position for k number of centroids

Data Assignment step, Centroid update step, and finalizing the algorithm

This last part of our code is where we build the data assignment step and the centroid update step to finalize the algorithm. First we create C_old array which is an array filled with zeros that has the same shape as C . Then we initialize another array clusters which is filled with zeroes and is the same length as X, which is the number of points that we have in our data set. We then calculate the error by finding the distance from C_old to C .

We enter the while loop that will run until the error that we calculated earlier equals 0. Then we enter the for loop for i in range(len(X)) that runs through the length of the X array. In the for loop we calculate the distance from the point at X[i] to every centroid in the C array, and store it as an array of length 3 in distance. For example, distance[0] contains the distance from X[i] to centroid C[0] . The next line min_distance = np.argmin(distance) returns the place of the minimum distance within the distance array. Then we store that value into clusters[i] so that later we know which values in the X array to compute the mean for.

Then we copy the values of the current clusters in C into the C_old array, and then recompute the new values of the centroids so we can store them into C. We have to create another for loop for i in range(k) that will calculate the new position for the centroid. We use list comprehension to return a list of values assigned to each cluster (i.e. returning a list of all the values in the cluster). This is done with the following line of code points = [X[j] for j in range(len(X)) if (clusters[j] == i)] . Then we calculate the mean of those points using the np.mean(points, axis = 0) function and assign that scalar value to the centroid C[i] at that particular iteration. After we calculate the values for all the new centroids, we computer the error again between C_old and C . If there is still an error we continue through the while loop until the error = 0 (i.e. the centroids stop moving).

After we finish finding the best positions for the centroids and the error = 0 . We simply graph each centroid and their assigned points. We also color code the clusters, so that we can view our results better. Below is the screenshot of the final results from the k-means clustering algorithm.

Final results of k-means clustering

Feature Engineering

Feature Engineering is fundamental to the application of machine learning. A feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model. The process of feature engineering is using domain knowledge to choose which data metrics to input as features into a machine learning algorithm.

Feature Engineering plays a key role in k-means clustering because, using meaningful features that capture the variability of the data is essential for the algorithm to find all the organic groups within the data set.

Conclusion

Thank you for reading this post. I hope you learned as much as I did about the k-means clustering algorithm, as I am still learning these topics as well. If there is anything that I could improve on or if there is an error in the post, please let me know. Thank you!

Citations:

--

--

Jared

I’m currently a college student studying Computer Science. I’m into Data Science, Cloud Infrastructure, and Blockchain development.