Clustering Models in Overview (Unsupervised Learning)

Salohy Miarisoa
Sep 2, 2018 · 3 min read

Unsupervised machine learning is a set of algorithms that attempts to use unlabeled input data, which means data comes with no correct answer or degree of error, that can be used as references for the algorithm. Common problems for unsupervised learning are Clustering and Association. This post is about clustering, showing you the overview of models used in clustering and the basic understanding of how respective algorithm works.

We talk about clustering, when we want to find a natural partitions of pattern in a given data. It is about grouping set of objects based on their characteristics and similarities. The following table represent the different ways of clustering.

Overview of Clustering models

Let us get a general understanding of how each of these example works.

Connectivity Models

The decision of merging clusters is based of their closeness measured with the distance. It can be Euclidean, Manhattan or Maximum distance. The model can use one of the 2 existing types of hierarchy.

  • Agglomerative clustering, where the algorithm starts with the individual objects, greedily put objects with maximum similarity into cluster and continue until each individual objects is assigned to one specific cluster.
  • Devisive clustering, where the algorithm starts with all objects in one big cluster, greedily split the cluster into two, assign objects to each group in a way to maximize within-group similarity, continue until there is a cluster with only one object.

Centroid Models

Probably the most used clustering models thanks to the K-Means Algorithm. It is simple to understand.

  1. We decide how many clusters k do we want.

2. The algorithm will place randomly k centroids in the data .

3. The algorithm will iteratively compute the distance between each data point and each centroid.

4. Assign the point having minimal distance to the centroid to its cluster.

5. Recompute centroids position: the new position will be the mean of all points assigned to the cluster in step 4.

6. Repeat successively step 3, 4 and 5 until no data points will not move from one cluster to another.

After the clustering is finished, each data point will exactly belongs to a specific cluster. That is what we call Hard Clustering.

Distribution Models

One popular example of this model is the Expectation-Maximization algorithm.

  1. We decide how many clusters k do we want

2. The algorithm will place randomly k Gaussian distribution with random mean and variance. These k distribution will be interpreted as clusters.

3. Expecation-step: Evaluate the conditional probability of each data point in order to find out how likely it belongs to each cluster.

4. Maximization-step: Use the conditional probability of each point to recompute the mean and the variance of each Gaussian distribution

5. Repeat step 3, 4 until convergence, which means the probability of each point to belonging to a cluster do not change any more

After the clustering algorithm is finished, we are able to assign each point likely to clusters. A data point could for example belongs to a cluster to 80% and to another to 20%. This is what we call Soft Clustering.

Density Models

The algorithm creates clusters based on the dense area in a d-dimensional space. Clusters are separated by areas with lower density. Following is the idea of the algorithm DBSCAN (Density Based Spatial Clustering Applications with Noise)

  1. Choose 2 important parameters which are the maximum radius of neighborhood and the minimum number of point in the neigborhood
  2. Pick randomly a data point that has not been visited, and determine if it is a core. Which means, find out if this point is a minimum point within its maximum neighborhood. If not, label it as outlier.
  3. Repeat 2 until the picked point is a core, add all directly dense reachable point to its cluster. If an outlier is added to the reachable points, label it as border.
  4. Repeat step 2 and 3 until all points are assigned to a cluster or labeled as outlier.

PS: Point b is directly dense reachable from a, if a is a core and b is in a’s maximum radius of neighborhood.

Only after the algorithm is finished, we know how many cluster do we have.

I hope, this could help you to have an overview about clustering. If you want all of this in practice, you can visit the following link.

Fun with data

What can we do with data?

Salohy Miarisoa

Written by

Write to better understand and not to forget

Fun with data

What can we do with data?

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade