# K-Means Clustering Explained

Clustering is a type of unsupervised learning where references need to be drawn from unlabeled datasets. Generally, it is used to capture the meaningful structure, underlying processes, and grouping inherent in a dataset. In clustering, the task is to divide the population into several groups in such a way that the data points in the same groups are more similar to each other than the data points in other groups. In short, it is a collection of objects based on their similarities and dissimilarities.

Let’s take an example, imagine you work in a Walmart Store as a manager and would like to better understand your customers to scale up your business by using new and improved marketing strategies. It is difficult to segment your customers manually. You have some data that contains their age and purchase history, here clustering can help to group customers based on their spending. Once the customer segmentation will be done, you can define different marketing strategies for each of the groups as per target audiences.

# K-Means clustering

K-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster. The goal is to identify the K number of groups in the dataset.

It is an iterative process of assigning each data point to the groups and slowly data points get clustered based on similar features. The objective is to minimize the sum of distances between the data points and the cluster centroid, to identify the correct group each data point should belong to.

Here, we divide a data space into K clusters and assign a mean value to each. The data points are placed in the clusters closest to the mean value of that cluster. There are several distance metrics available that can be used to calculate the distance.

# How does K-means work?

Let’s take an example to understand how K-means work step by step. The algorithm can be broken down into 4–5 steps.

**Choosing the number of clusters**

The first step is to define the K number of clusters in which we will group the data. Let’s select K=3.

**Initializing centroids**

A centroid is the center of a cluster but initially, the exact center of data points will be unknown, so we select random data points and define them as centroids for each cluster. We will initialize three centroids in the dataset.

**Assign data points to the nearest cluster**

Now that** **centroids are initialized, the next step is to assign data points *X*n to their closest cluster centroid *C*k. In this step, we will first calculate the distance between data point X and centroid C using Euclidean Distance metric.And then choose the cluster for data points where the distance between the data point and the centroid is minimum.

**Re-initialize centroids**

Next, we will re-initialize the centroids by calculating the average of all data points of that cluster.

**Repeat steps 3 and 4**

We will keep repeating steps 3 and 4 until we have optimal centroids and the assignments of data points to correct clusters are not changing anymore.

Does this iterative process sound familiar? Well, K-means follows the same approach as Expectation-Maximization(EM). EM is an iterative method to find the maximum likelihood of parameters where the machine learning model depends on unobserved features. This approach consists of two steps Expectation(E) and Maximization(M) and iterates between these two.

For K-means, The Expectation(E) step is where each data point is assigned to the most likely cluster and the Maximization(M) step is where the centroids are recomputed using the least square optimization technique.