K-Means Clustering For Beginners

4 min readAug 30, 2021

What Is Unsupervised Learning?

Unsupervised learning is a system of machine learning in which training data is unlabeled. In a supervised learning system, the labels provide a sort of North Star or ground truth that a model uses as a foundation to base and assess predictions. Without these labels a model is left to find relationships, patterns, and similarities on its own, which is the case in unsupervised learning.

Among the most common unsupervised learning algorithms is k-means clustering. Clustering is a technique that segments data into groups according to distance. Data points that are closer in distance have more in common and data points that are further from each other have less in common. Ideally clusters will have a high similarity between data points within the cluster and a low similarity across clusters. Let’s break down the steps that the k-means algorithm takes to classify data and optimize cluster centroids.

The Steps

The first step the k-means algorithm takes once it is called upon is to randomly assign centroids, or cluster centers, to the data. The initial center of these clusters will be different each time you run the algorithm which could result in different final centroids. Sklearn’s k-means algorithm uses a metric called ‘k-means++’ by default which is meant to speed up convergence by smartly assigning centroids initially. It will do things like properly spreading centroids apart and properly spreading centroids out across the data. Once the centroids are initialized, all data points are assigned to the centroid that is closest in proximity (k-means uses the euclidean distance to calculate distance from centroids.)

Next comes the ‘mean’ aspect of the k-means algorithm. The mean for each cluster is calculated using all the points within said cluster, new centroids are assigned based on mean values, and assignment begins all over again. The k-means algorithm goal is to choose centroids that minimize within cluster sum of squared error, or inertia. This iterative process is continued until no data point is reassigned to a new class.

Using K-Means Clustering In Python

K-means clustering is made simple with the sklearn library. You’ll first want to import KMeans from sklearn.cluster. In the example below I’ve loaded the all too common Iris dataset to make predictions with.

As you can see, you will also need to specify the number of centroids your model will have. In this case I have chosen 3 centroids because I know that I have three distinct flower species: Iris setosa, Iris virginica, Iris versicolor. In real use cases you will not know how many clusters to pick, but I will go over how to find the optimal number of clusters later.

After k-means is fitted to my data I can view what clusters each of the data points were assigned to…

… or view the locations of final centroids…

… or more importantly make predictions using unseen data.

Do you have the right number of centroids?

The elbow method is one way to find the optimal number of clusters. The ideal k is one that minimizes inertia and also uses as few clusters as needed to explain the variation in the data. Note that the more clusters there are, the lower inertia will be, this is why you look for the elbow of a plot.

The yellowbrick library provides a simple elbow plot generator, KElbowVisualizer. Below is an example using the Iris dataset and as you can see I have passed into KElbowVisualizer the model I would like to use and the range for k I would like to examine.

The elbow is at 3 because as I mentioned before there are 3 distinct iris species.

Use Cases

Market segmentation is the most common use case for clustering. By utilizing k-means clustering for market distinctions, you can group your client base by similar interests or similar behavior and form separate marketing and advertising strategies for each distinct group. This allows us to optimize marketing spend and maximize conversion. Knowing your client base — their interests, their behavior, their needs and wants — will make targeting them all the easier.