K-Means Clustering Algorithm for Machine Learning

Part 2 of a Series on Introductory Machine Learning Algorithms

Madison Schott
Capital One Tech
5 min readApr 23, 2019

--

Woman solving math equations.

In our first post, we covered k-nearest neighbor. Today we’ll cover k-means clustering.

Introduction

K-means clustering is another basic technique often used in machine learning. While machine learning is often thought of as a fairly new concept, the fundamentals have been around for much longer than many would expect. Specifically, the k-means clustering algorithm has been around since 1967 when it was first developed by a researcher named James MacQueen. Unlike many other machine learning techniques, k-means is used on unlabeled numerical data rather than data that is already defined, making it a type of unsupervised learning. It is one of the most popular unsupervised learning techniques due to its simplicity and efficiency, helping us data scientists out when we don’t have the most organized data set.

The k-means clustering algorithm assigns data points to categories, or clusters, by finding the mean distance between data points. It then iterates through this technique in order to perform more accurate classifications over time. Since you must first start by classifying your data into k categories, it is essential that you understand your data well enough to do this.

Pros:

  • Fast and efficient.
  • Works on unlabeled numerical data.
  • Iterative technique.

Cons:

  • Must understand the context of your data well.
  • Have to choose your own k value.
  • Lots of repetition.
  • Does not perform well when outliers are present.

Steps to Creating a K-Means Model

There are three main steps when using the k-means clustering technique.

Step #1

First, you need to choose a value for k based off your data and what value makes the most sense. K is the number of categories, or clusters, you believe your data set consists of. If you are really unsure of what value to make k, it is best to try different values until you find one which works best for your data set. You can then compare the different models generated from different k values and choose the one that makes the most sense to you.

Step #2

Second, you need to create k clusters by assigning each data point to a nearby cluster. After you randomly choose clusters a centroid — or the center-most point of a cluster — will be generated based on the means of the data points in each cluster.

Step #3

Lastly, you need to repeat the previous steps until the convergence criterion is reached. Iteration takes place so that mean distances will continue to be generated until centroid values no longer change. Once they stop changing, you have the most optimal algorithm for your data set based on the k-means clustering technique.

Where to Use K-Means

There are many applications in many different industries for the k-means clustering algorithm. It can be used to identify different conditions in health screenings, spot changes in urban traffic patterns, separate bot activity from human activity, and segment customers into different target markets. It is often used for business purposes to determine how customer behavior changes over time. If data points move from one centroid to another over time, this can give valuable insight to how one’s customer behavior is changing.

For example, a grocery store can look at different purchases its customers make in order to determine which coupons to send them. If someone is purchasing frozen meals and beer, the grocery store may classify them into a “millennials” cluster. However, if another person is buying baby formula and cookie dough, the store may classify them into a “new family” cluster. From here the grocery store can choose the number of customer segments they believe they have and compute k-means clustering on this data set. They can repeat the computation on different k values to see which number of customer segments, and how they are classified, will make the most sense for their store.

In the end, the grocery store will be able to more accurately classify each of its customers and send them the coupons they are most likely to use.

The Mathematics Behind K-Means

After Step #2 in creating a k-means clustering algorithm, centroids are computed based on the mean between data points in a cluster. The formula used to compute this distance is called Euclidean distance.

https://en.wikipedia.org/wiki/Euclidean_distance

Never heard of Euclidean distance? Actually, you probably have! It’s the distance formula that you probably learned of in the 3rd grade. It’s simply the straight line distance between two points. In this case, the distance is calculated between the centroid and each data point in a cluster. Then the mean distance of all the points in a cluster is taken and used to form the centroids.

https://stackoverflow.com/questions/32930647/spectral-clustering-and-multi-dimensional-scaling-in-pytho

While most problems are multidimensional and look like the one above, it is still important that you understand the basic 2D formula.

Conclusion

K-means clustering is a fast and efficient algorithm to classify data points into categories when you have little available information about your data. However, keep in mind this algorithm may not be the best technique for your data set. If only one cluster naturally forms, it’s likely the k-means algorithm won’t give you the results you want. In this case, you might want to turn to another machine learning algorithm.

Like with all unsupervised learning, it is important that you generally understand your data before deciding which technique will work best in solving your problem. In return, these techniques can help you understand the data points that you previously did not. Using the right algorithm saves you time and helps you gain more accurate results. And who doesn’t want to use that saved time in working on their next machine learning project?

For more resources, check out some projects using k-means clustering:

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

--

--

Madison Schott
Capital One Tech

Analytics Engineer @ ConvertKit, author of the Learn Analytics Engineering newsletter and The ABCS of Analytics Engineering ebook, health & wellness enthusiast