“MLshorts” 16: What is Unsupervised Machine Leaning?

A simple and clear explanation

Vasilis Kalyvas
Python’s Gurus
4 min readJun 19, 2024

--

Image created with Leonardo.Ai — A robot categorizes similar bricks into groups

What is Unsupervised ML? 🤔

Unsupervised learning is a type of Machine Learning where the algorithm is given data without any labels or explicit instructions on what to do with it, in contrast to a Supervised algorithm.

The goal of an Unsupervised ML model is to find patterns, structures, or relationships in the data on its own.

To differentiate from Supervised learning, imagine that you have a group of football players and corresponding features such as age, team, total goals scored, total dribbles, number of appearances, speed etc. You can formulate two scenarios:

  1. In Supervised learning, you could provide the market prices of all those player as labels to a model. The model learns the relationships between features and price. Then, when given the characteristics of a new player, it will be able to predict his market value.
  2. In Unsupervised learning, the model isn’t given any price or label. Instead, it tries to categorize the players into groups based on their features. So it might create a group of attackers, another of midfielders, another of wingers, another of top performers etc.

What are some Unsupervised algorithms? 📋

The most basic ones are:

  • K-Means Clustering: for gouping data points into a number of groups (also known as “clusters”) based on the features of these data points
  • Hierarchical Clustering: for building a tree of clusters, where data points start in their own cluster and, then, pairs of clusters are merged when moving up the hierarchy.
  • Principal Component Analysis (PCA): for reducing the dimensionality of the data while retaining most of the variation in the dataset
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): for grouping together closely packed points and marks points in low-density regions as outliers

How an Unsupervised model actually “learns”? 📖

Like we asked in the previous story, we tend to say this word without even thinking or understanding, but what “learning” actually means??

How could a model create clusters and categorize the data? And who decides its accuracy and correctness if there are no labels to learn from?

Unsupervised models again learn patterns in the data, but have a completely different way than the Supervised ones (which use labels) and try to figure out these patterns and structure on their own.

For Clustering, it generally goes like this:

  1. The model is given training data, with their features but no corresponding labels.
  2. The model initiates by randomly selecting some data points as the centroids, which are the centers of the clusters (the groups). This number of clusters is usually predefined.
  3. Every other data point is categorized to its nearest cluster, based on distance metrics. Now clusters are formed and contain all data points.
  4. The model calculates again the centroids and, if needed, changes them and readjusts the data points to new clusters.
  5. It re-iterates until all clusters are stable.

Now this is too simplified, but it’s the general concept. But, as there is no way to algorithm can evaluate the results, the final “ok” is given by us, the humans. We can evaluate its performance with certain metrics such as Silhouette Score and Elbow Method by measuring similarity and determining the optimal number of clusters, respectively.

In PCA, data is transformed into a new coordinate system that creates new “features” known as components in a way that the first component (“first principal”) ends up with the largest variance, the second component (“second principal”) has the second largest variance and so on.

On the other hand, DBSCAN identifies core clusters points (that have many neighbors) and expands the clusters from these core points, by grouping those that are closely packed together.

So, to make it clear, “learning” in Clustering is:

  • Initialization of centroids
  • Formation of clusters
  • Re-calculation of new centroids
  • Re-formation of clusters (until stabilization)\
  • Evaluation by humans

Why is Unsupervised learning important? 💎

Unsupervised learning discovers hidden structures in data without prior labels or training. It finds patterns in complex datasets and is super useful in applications like Customer Segmentation, Market Basket Analysis, Document Clustering, Image Compression etc. It is one of the most fundamental concepts for someone who starts exploring the magnificent ML world and a powerful tool in every Data Scientist’s arsenal.

Was this article valuable for you? Follow, subscribe, connect on LinkedIn/Kaggle and see you in my next “MLshorts” article! 👋

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--