Unsupervised Machine learning: an Introduction to Clustering Algorithms

Frehiwot Gebrekrstos Girmay

Published in

School of ML

3 min readAug 5, 2020

There are three main approaches to machine learning:

Supervised learning: Learns from data that contains both the inputs and expected outputs (e.g., labeled data). Common types are: Classification, Regression, Similarity learning, Feature learning(Learns to automatically discover the representations or features from raw data) , and Anomaly detection(A special form of classification, which learns from data labeled as normal/abnormal).
Unsupervised learning: Learns from input data only; finds hidden structure in input data.Common types are:Clustering, Feature learning(Features are learned from unlabeled data.), and Anomaly detection(Learns from unlabeled data, using the assumption that the majority of entities are normal).
Reinforcement learning: Learns how an agent should take action in an environment in order to maximize a reward function. common type is Markov decision process: A mathematical process to model decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. Does not assume knowledge of an exact mathematical model.

In this article we will look in to one of the unsupervised machine learning approaches: clustering algorithm

Clustering Algorithms

Types of Clustering

There are four common types of clustering .

Centroid-based Clustering

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. This course focuses on k-means because it is an efficient, effective, and simple clustering algorithm.

Density-based Clustering

Density-based clustering connects areas of high example density into clusters. This allows for arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters.

Distribution-based Clustering

This clustering approach assumes data is composed of distributions, such as Gaussian distributions. In Figure 3, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution’s center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability. When you do not know the type of distribution in your data, you should use a different algorithm.

Hierarchical Clustering

Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies.