An Intuition to K-Means Clustering
The very basics of K-Means Clustering
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Unsupervised algorithms make inferences from data sets using only input vectors without referring to known, or labeled, outcomes.
What is unsupervised learning?
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. Unsupervised algorithms find patterns based only on input data. The technique is useful when we’re not quite sure what to look for.
Here, are prime reasons for using Unsupervised Learning:
- Unsupervised machine learning finds all kinds of unknown patterns in data.
- Unsupervised methods help you to find features that can be useful for categorization.
- It is taken place in real-time, so all the input data to be analyzed and labeled in the presence of learners.
- It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.
What are clustering algorithms?
Clustering algorithms do the task of dividing the population of data points into a variety of groups such that data points within the same cluster are similar to the other data points within the same cluster than those in the other groups. Basically, the aim is to separate groups with similar traits and assign them to clusters.
What is K-means Clustering?
In this algorithm, we group the items into K clusters such that all items in the same clusters are as similar to each other as possible. And items not in the same cluster are as different as possible.
Distance measures (like Euclidean distance) are used to calculate similarity and dissimilarity between the data points. Each cluster has a centroid. Centroid can be thought of as the data point that is the most representative of the cluster.
How does K-means Clustering work?
1. K initial “means” (in case k=3) are randomly generated within the data domain.
2. K-clusters are created by associating every observation to the nearest mean.
3. The centroid of each of the k clusters becomes the new mean.
4. Steps 2 and 3 are repeated until the convergence has been reached.
Some applications of this machine learning technique are:
1. Clustering automatically split the dataset into groups base on their similarities
2. Anomaly detection can discover unusual data points in your dataset. It is useful for finding fraudulent transactions
3. Association mining identifies sets of items which often occur together in your dataset
4. Latent variable models are widely used for data preprocessing. Like reducing the number of features in a dataset or decomposing the dataset into multiple components.
Disadvantages of an Unsupervised Learning approach
1. You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known.
2. Less accuracy of the results is because the input data is not known and not labeled by people in advance. This means that the machine requires to do this itself.
3. The spectral classes do not always correspond to informational classes. The user needs to spend time interpreting and label the classes which follow that classification.
4. Spectral properties of classes can also change over time, so you can’t have the same class information while moving from one image to another.