cyberpunk astronaut by CLIP

Unsupervised Learning: The most popular algorithms.

Iva @ Tesla Institute
Artificialis
Published in
7 min readJan 19, 2023

--

Unsupervised learning is a machine learning technique that enables a model to learn from unlabeled data. Unlike supervised learning, unsupervised learning does not require labeled data for training, which makes it a valuable tool for a wide range of applications. In this article, I’ll summarize unsupervised learning methods, their strengths and weaknesses, and current research in this field.

CLUSTERING

One of the most popular unsupervised learning methods is clustering.

Clustering algorithms group similar data points together and identify patterns within the data. For example, k-means clustering is a widely used method that groups data points into k clusters based on their similarity. Another popular method is hierarchical clustering, which creates a tree-like structure of clusters. We can use clustering in a variety of applications, such as image segmentation, anomaly detection, and market segmentation.

  • K-MEANS CLUSTERING

K-Means Clustering is a simple yet powerful algorithm that can be used for grouping similar data points together. It is sensitive to initial centroids and assumes spherical clusters, but it can handle large datasets and is easy to parallelize. We widely used it in various applications such as image and speech recognition, market segmentation, and customer segmentation. We should take care when selecting the number of clusters, and the elbow method can be a useful tool in that regard.

  • HIERACHICAL CLUSTERING

Hierarchical clustering is a type of unsupervised machine learning algorithm used for grouping similar objects into clusters or groups based on their similarity. The algorithm starts by treating each object as its own cluster and then repeatedly combines the closest two clusters until we meet a predefined stopping criterion.

There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative hierarchical clustering starts with each object as its own cluster and then repeatedly merges the closest two clusters until the desired number of clusters is reached. Divisive hierarchical clustering starts with all objects in one cluster and then repeatedly splits the cluster into smaller clusters until each cluster contains only one object.

One of the key advantages of hierarchical clustering is that it can handle non-linearly separable data, which is data that cannot be easily separated into distinct groups by a linear boundary. Hierarchical clustering can also produce a dendrogram, which is a tree-like diagram that illustrates the hierarchical structure of the clusters. This can be useful for visualizing the relationships between the objects and the clusters.

DIMENSIONALITY REDUCTION

While more data yields more accurate results, it can also impact the performance of machine learning algorithms (e.g. over-fitting) and it can also make it difficult to visualize datasets. Dimensionality reduction is a technique used when the number of features, or dimensions, in a dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the integrity of the dataset as much as possible. It is commonly used in the preprocessing data stage.

Another important unsupervised learning method is dimensionality reduction. Dimensionality reduction algorithms aim to reduce the number of features in a dataset while preserving the most important information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two popular dimensionality reduction methods.

  • PCA transforms the data into a new coordinate system in which the first principal component explains the most variance in the data. This method uses a linear transformation to create a new data representation, yielding a set of “principal components.” The first principal component is the direction which maximizes the variance of the dataset. While the second principal component also finds the maximum variance in the data, it is completely uncorrelated to the first principal component, yielding a direction that is perpendicular, or orthogonal, to the first component. This process repeats based on the number of dimensions, where a next principal component is the direction orthogonal to the prior components with the most variance.
  • LDA aims to find a linear combination of features that maximizes the separation between different classes.

We can use dimensionality reduction for visualization, feature selection, and noise reduction.

  • VARIATIONAL AUTOENCODERS

Generative models are another class of unsupervised learning methods that aim to learn the underlying probability distribution of the data. Autoencoder and Variational Autoencoder are popular generative models that can be used for anomaly detection, data generation, and representation learning. Autoencoder is a neural network that learns to reconstruct the input data from a lower-dimensional representation, called the bottleneck or latent variable. Variational Autoencoder (VAE) is a variant of Autoencoder that allows for the generation of new samples from the learned probability distribution.

Often, labels are hard to get because they require a human to review. Instead, the unlabeled data can find patterns and feature representations. For example, you can build a better classifier by using Autoencoders as a Feature Extractor. Or if you don’t have any labels at all you could use Autoencoder to detect anomalies. AE can also fill-in missing values.

How it functions? First it compresses its input data into a lower dimension then it tries to use this lower dimensional representation of the data to recreate the original input the difference between the attempted recreation and the original input is called the reconstruction error by training the network to minimize this reconstruction error on your dataset, the network learns to exploit the natural structure in your data to find an efficient lower dimensional representation. The encoder aproximates the function, maps the data from the input space into lower dimensional coordinate system, which means we don’t need every part of our imput space to represent the data. And, it’s the encoder’s job to take the data and compress it into meaningful lower dimension.

Variational Autoencoder structure

Now, let’s move to decoder: the decoder the decoder attempts to recreate the original input using the output of the encoder, in other words it tries to reverse the encoding process.

The point of the middle layer in an auto encoder is to make it even smaller dimension:

this forces information loss which is key to this whole process. This causes information loss — key concept of the embedding layer of the autoencoders. It works by making it so that the decoder has imperfect information and training the whole network to minimize the reconstruction error. We forced the encoder and decoder to work together tofind the most efficient way to condense the input data into a lower dimension. If we did not have information lost between the encoder and decoder then the network could simply learn to multiply the input by one and get a perfect reconstruction and this would obviously be a useless. One the only way auto-encoders work is by enforcing this information loss with the network bottleneck but this means we need to tune the architecture of our network so that the inner dimension is less then the dimension needed to express our data.

  • DENOISING AUTOENCODERS

A denoising autoencoder is a type of neural network that is trained to remove noise from input data. The network is trained to reconstruct the original, noise-free input from a corrupted version of the input. The denoising process is performed by the encoder portion of the network, which maps the input to a hidden representation, and the decoder portion, which maps the hidden representation back to the original input.

The idea behind denoising autoencoders is to learn a representation of the input that is robust to noise. This is achieved by adding noise to the input data during training and then training the network to reconstruct the original, noise-free input. By doing so, the network learns to identify and remove the noise from the input, while still preserving the important features of the input.

Denoising autoencoders can be used in a variety of applications such as image denoising, where the network is trained to remove noise from images, and anomaly detection, where the network is trained to identify patterns that deviate from the normal data.

Another important property of denoising autoencoders is their ability to learn a robust and compact representation of the input. This is because the encoder is forced to learn a representation that is robust to noise, which in turn makes the representation more compact and efficient.

CONCLUSION:

Self-supervised learning is a recent research area that aims to learn from unlabeled data by designing pretext tasks that can be solved using the same data. These tasks are designed to learn useful representations that can be transferred to other tasks. For example, Contrastive learning and Generative Pre-training are self-supervised methods that have been used in computer vision and natural language processing (NLP) tasks with success.

In conclusion, unsupervised learning is a powerful tool for a wide range of applications, including clustering, dimensionality reduction, generative models, and self-supervised learning. Each method has its own strengths and weaknesses, and the choice of method depends on the specific problem and the type of data. Unsupervised learning is an active research area, and new methods and applications are being developed.

RESOURCES:

  • IBM
  • Papers With Code
  • Displayr

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝