K-Means Clustering — An Unsupervised Machine Learning Algorithm
K-Means is a clustering algorithm that is used when you have unlabeled data. As described in the title, it is an unsupervised machine learning algorithm and also a powerful algorithm in data science. In this article, we will be discussing K-Means Clustering briefly and its implementation.
K-Means is a type of Partitioning Clustering and hence one of the simplest yet powerful machine learning algorithms. As it is an unsupervised algorithm, K-Means makes its inferences from datasets using only input vectors without referring to labeled outcomes.
As we have a little understanding of K-Means, let’s deep dive into it and learn more about the K-Means algorithm.
What is K-Means?
K-Means divides the objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
Now you must be thinking that how does the algorithm know, how many clusters should be made? Well, that is the value of ‘K’. For example, K = 5 refers to five clusters. Whereas, the ‘Means’ in K-Means refers to averaging of the data i.e. finding the centroid (the center of a cluster).
Now, as we have understood what is K-Means, let’s have a look at its implementation.
How K-Means Algorithm Works
There are multiple steps to implement K-Means Clustering Algorithm:
Step-1: Choose the value of K to decide the number of clusters.
Step-2: Select random K points or centroids.
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassigning each data point to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else you can end it.
Now, as we are familiar with the process of K-Means, let’s have a look at the implementation of the algorithm in Python.
Code
Following is a simple implementation of K-Means. If you need to understand the code in-depth, I have provided a link in the references section of this article that refers to this code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inlineX= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()Kmean = KMeans(n_clusters=2)
Kmean.fit(X)
Kmean.cluster_centers_plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()Kmean.labels_sample_test=np.array([-3.0,-3.0])second_test=sample_test.reshape(1, -1)Kmean.predict(second_test)
Output
Conclusion
Typically, K-Means can be applied to data that has a smaller number of dimensions, is numeric, and is continuous or can be used to find groups that have not been explicitly labeled in the data. As an example, it can be used for Document Classification, Delivery Store Optimization, or Customer Segmentation.
That’s all for now! I hope this article has helped you in getting brief knowledge of K-Means Clustering.
Feel free to put a comment below if you have any suggestions or questions.