One of the popular problems in unsupervised learning is clustering. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.
As in above diagram the result of clustering is colouring of the squares into three clusters.
One of the basic approach to solve cluster analysis problem is K-means. K-means algorithm partitioned the data into K clusters .
In general, suppose we have n data points, that have to be partitioned in K clusters. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the K positions of the clusters that minimize the distance(for example Euclidian distance) from the data points to the cluster by minimizing distance loss function. K means do hard assignment for the data points that means a point either totally belongs to a cluster or not at all. For detailed description about how K-means clustering algorithm work see the reference section.
Issues with K means:
There are some issues with K-means algorithm. K-means often doesn’t work when clusters are not round shaped because of it uses some kind of distance function and distance is measured from cluster center.
Another major problem with K-Means clustering is that Data point is deterministically assigned to one and only one cluster, but in reality there may be overlapping between the cluster for example picture shown below:
Here image b denotes unlabeled data that we are going to cluster and image c denotes the ambiguity that occur at overlapping region.
While doing clustering we can not tell exactly about the cluster of data points that belongs to overlapping regions of picture c. For fixing this data points are assigned to clusters with certain probabilities and this is what gaussian mixture model do.
Gaussian Mixture Models(GMM):
For address these problems gaussian mixture model was introduced. A probabilistic approach to clustering addressing many of these problems. In this approach we describe each cluster by its centroid (mean), covariance , and the size of the cluster(Weight).
Here rather than identifying clusters by “nearest” centroids, we fit a set of k gaussians to the data. And we estimate gaussian distribution parameters such as mean and Variance for each cluster and weight of a cluster. After learning the parameters for each data point we can calculate the probabilities of it belonging to each of the clusters.
So mathematically we can define gaussian mixture model as mixture of K gaussian distribution that means it’s a weighted average of K gaussian distribution. So we can write data distribution as
Where N(x|mu_k,sigma_k) represents cluster in data with mean mu_k and co variance epsilon_k and weight pi_k.
After learning the parameters mu_k, epsilon_k and pi_k we learnt k gaussian distribution from the data distribution that k gaussian distributions are clusters.
Since we learnt K distributions(cluster) then for anydata point x depending upon its distance from every distribution we will get probability of it belong to each distribution.
For example after getting component distribution in below image depending upon variance and mean of particular cluster we get probability of any data point x belongs to every cluster.
(Picture taken from Stack Overflow)