Gaussian Mixture Model

Yara Amin
3 min readMay 29, 2020

--

Clustering is an unsupervised learning problem where we intend to find clusters of points in our dataset that share some common characteristics.

K-Means Clustering

It is an algorithm, for linear data which classifies samples based on attributes/features into K number of clusters. Clustering or grouping of samples is done by minimizing the distance between sample and the centroid. i.e. Assign the centroid and optimize the centroid based on the distances from the points to it. This is called as Hard Assignment i.e. We are certain that particular points belong to particular centroid and then based on the least squares distance method, we will optimize the place of the centroid.

So Now if we want to know:

  • What is the probability that a point x is in cluster m?
  • What is the shape of each cluster?

We should use Probabilistic clustering ..

There’s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model.

In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian. The entire data set is therefore modelled by a mixture of these distributions.

The most widely used clustering method of this kind is Gaussian Mixture Models

A Gaussian Mixture is a function that is comprised of several Gaussians, each identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each Gaussian k in the mixture is comprised of the following parameters:

  • A mean μ that defines its centre.
  • A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
  • A mixing probability π that defines how big or small the Gaussian function will be.

It works so well on non-linear datasets.

Advantages:

  1. Does not assume clusters to be of any geometry. Works well with non-linear geometric distributions as well.
  2. Does not bias the cluster sizes to have specific structures as does by K-Means (Circular).

Disadvantages:

1. Uses all the components it has access to, so initialization of clusters will be difficult when dimensionality of data is high.

2. Difficult to interpret.

Implementation in Python

Just as a side note, the full implementation is available as a Colab notebook

Silhouette score

Silhouette score checks how much the clusters are compact and well separated. The more the score is near to one, the better the clustering is. Read more here

Since we already know that the fitting procedure is not deterministic, we run twenty fits for each number of clusters, then we consider the mean value and the standard deviation of the best five runs.

n_clusters=np.arange(2, 8)
sils=[]
sils_err=[]
iterations=20
for n in n_clusters:
tmp_sil=[]
for _ in range(iterations):
gmm=GaussianMixture(n, n_init=2).fit(X_principal)
labels=gmm.predict(X_principal)
sil=metrics.silhouette_score(X_principal, labels, metric='euclidean')
tmp_sil.append(sil)
val=np.mean(SelBest(np.array(tmp_sil), int(iterations/5)))
err=np.std(tmp_sil)
sils.append(val)
sils_err.append(err)

Distance between GMMs

Here we form two datasets, each with an half randomly choose amount of data. We will then check how much the GMMs trained on the two sets are similar, for each configuration.

Since we are talking about distributions, the concept of similarity is embedded in the Jensen-Shannon (JS) metric. The lesser is the JS-distance between the two GMMs, the more the GMMs agree on how to fit the data.

The lower the distance, the better the cluster.

def gmm_js(gmm_p, gmm_q, n_samples=10**5):
X = gmm_p.sample(n_samples)[0]
log_p_X = gmm_p.score_samples(X)
log_q_X = gmm_q.score_samples(X)
log_mix_X = np.logaddexp(log_p_X, log_q_X)

Y = gmm_q.sample(n_samples)[0]
log_p_Y = gmm_p.score_samples(Y)
log_q_Y = gmm_q.score_samples(Y)
log_mix_Y = np.logaddexp(log_p_Y, log_q_Y)

return np.sqrt((log_p_X.mean() - (log_mix_X.mean() - np.log(2))
+ log_q_Y.mean() - (log_mix_Y.mean() - np.log(2))) / 2)
n_clusters=np.arange(2, 8)
iterations=20
results=[]
res_sigs=[]
for n in n_clusters:
dist=[]

for iteration in range(iterations):
train, test=train_test_split(X_principal, test_size=0.5)

gmm_train=GaussianMixture(n, n_init=2).fit(train)
gmm_test=GaussianMixture(n, n_init=2).fit(test)
dist.append(gmm_js(gmm_train, gmm_test))
selec=SelBest(np.array(dist), int(iterations/5))
result=np.mean(selec)
res_sig=np.std(selec)
results.append(result)
res_sigs.append(res_sig)

--

--