Machine Learning Algorithm
The k-means clustering method is an unsupervised learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable. These traits make implementing k-means clustering in Python reasonably straightforward, even for novice programmers and data scientists.
Working of K-Means Algorithm
Understand how the K-Means clustering works.
- First, we need to provide the number of clusters, K, that need to be generated by this algorithm
- Choose K data points at random and assign each to a cluster. Briefly, categorize the data based on the number of data points.
- The cluster centroids will now be computed.
- Keep doing the steps until we find the ideal centroid, which is the assigning of data points to clusters that do not vary.
- The sum of squared distances between data points and centroids would be calculated first.
- At this point, we need to allocate each data point to the cluster that is closest to the others (centroid).
- Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.
We must keep the following points in mind
- It is suggested to normalize the data while dealing with clustering algorithms such as K-Means since such algorithms employ distance-based measurement to identify the similarity between data points.
- Because of the iterative nature of K-Means and the random initialization of centroids, K-Means may become stuck in a local optimum and fail to converge to the global optimum. As a result, it is advised to employ distinct centroids’ initializations
Graphical Form of K Means Clustering
- Let us pick k clusters, K=2, to separate the dataset and assign it to its appropriate clusters. We will select two random places to function as the cluster’s centroid.
- Now, each data point will be assigned to a scatter plot depending on its distance from the nearest K-point or centroid. This will be accomplished by establishing a median between both centroids.
3.The points on the line’s left side are close to the blue centroid, while the points on the line’s right side are close to the yellow centroid. The left Form cluster has a blue centroid, whereas the right Form cluster has a yellow centroid.
4.Repeat the procedure, this time selecting a different centroid. To choose the new centroids, we will determine their new center of gravity, which is represented below:
5.After that, we’ll re-assign each data point to its new centroid. We shall repeat the procedure outlined before (using a median line). The blue cluster will contain the yellow data point on the blue side of the median line.
6.Now that reassignment has occurred, we will repeat the previous step of locating new centroids.
7.We will repeat the procedure outlined above for determining the center of gravity of centroids, as shown below.
8. Similar to the previous stages, we will draw the median line and reassign the data points after locating the new centroids.
9.We will finally group points depending on their distance from the median line, ensuring that two distinct groups are established and that no dissimilar points are included in a single group.
This is the final cluster
Advantages and Disadvantages
The below are some of the features of K-Means clustering algorithms:
- It is simple to grasp and put into practice.
- K-means would be faster than Hierarchical clustering if we had a high number of variables.
- An instance’s cluster can be changed when centroids are re-computation.
- When compared to Hierarchical clustering, K-means produces tighter clusters.
Some of the drawbacks of K-Means clustering techniques are as follows:
- The number of clusters, i.e., the value of k, is difficult to estimate.
- A major effect on output is exerted by initial inputs such as the number of clusters in a network (value of k).
- The sequence in which the data is entered has a considerable impact on the final output.
- It’s quite sensitive to rescaling. If we rescale our data using normalization or standards, the outcome will be drastically different. ultimate result
- It is not advisable to do clustering tasks if the clusters have a sophisticated geometric shape.
Every machine learning engineer wants their algorithms to make accurate predictions. These sorts of learning algorithms are often classified as supervised or unsupervised. K-means clustering is an unsupervised technique that requires no labeled response for the given input data.
K-means clustering is a widely used approach for clustering. Generally, practitioners begin by learning about the architecture of the dataset. K-means clusters data points into unique, non-overlapping groupings. It works very well when the clusters have a spherical form. However, it suffers from the fact that clusters’ geometric forms depart from spherical shapes.
Additionally, it does not learn the number of clusters from the data and needs that it be stated beforehand. It’s always beneficial to understand the assumptions behind algorithms/methods in order to have a better understanding of each technique’s strengths and drawbacks. This will assist you in determining when and under what conditions to utilize each form.