Supervised v/s Unsupervised Learning — Simplified.

Yash Gupta
Data Science Simplified
8 min readNov 5, 2020

If you’ve dived into Data Science or specifically into Predictive Analytics, there is a high possibility that you’ve heard of these terms that’s called ‘Supervised’ and ‘Unsupervised’ Learning. For those new to these terms, these are two types of Predictive modelling possible or two types of Machine Learning algorithms that work on two different principles in order to give you information about your data.

It is highly associated to Predictive Analytics solely because they don’t just give you an insight into your data but also can help you place a prediction about new values.

In this article, we’ll go about non-coding examples and ideas that will help you understand these types of algorithms without any prior knowledge. This article is mainly focused on individuals and students starting their journey towards data science and would like to understand how such algorithms actually function (without a lot of technical jargon)

What is Machine Learning?

Before we essentially understand the types of Machine Learning algorithms ahead in this article, let’s understand a very brief explanation of what ML really is. Machine Learning (ML) is the process of training a computer to understand data and to build ML models primarily for predicting new values or future values in case of Time Series data to accurately estimate data based on historic data or other factors. It is highly used in every industry and has evolved into Deep Learning and Artificial Neural Networks which gives computers power beyond ever before to actually function like a Human Brain.

ML is applicable on any data where predictive modelling is necessary and it has two major categories; Supervised and Unsupervised Learning. It is one of the most demanded skills and a must have for Data Science enthusiasts.

For an easier and in depth understanding of how Machine Learning works, read my first article as shared below!

Supervised Learning and Unsupervised Learning?

Before diving deep into these concepts, let’s go over a simple understanding of what they are.

Supervised Learning is applicable where you know what you want to find/predict and what factors you will use to predict these target fields. The predictable fields are to an extent explainable by the supporting/other fields present in the dataset.

Unsupervised Learning is applicable when you don’t have an understanding of the data and you don’t know anything about it except that it is a set of numerical fields and you don’t have any supporting arguments to understand the predictable field. In such a case, you try to group these numbers together to build an initial understanding of it using Unsupervised Learning.

Following are some examples of Supervised and Unsupervised Machine Learning Algorithms/processes widely used around us:

Supervised Learning:

  1. Linear Regression
  2. Logistic Regression
  3. Polynomial/Multilinear Regression
  4. Support Vector Machines
  5. Decision Trees
  6. XGBoost
  7. Random Forests

Unsupervised Learning:

  1. K Means Clustering
  2. Principal Component Analysis
  3. UMAP for Dimensionality Reduction
Random Forests (Supervised) and Principal Component Analysis (Unsupervised) Visualizations (Courtesy: images.google.com)

Understanding these in depth individually will take a set of individual articles for each and will be looked into in articles to come in the future. For now let’s go over examples for the concepts of Supervised and Unsupervised Learning to understand them in a simpler way.

Example for Supervised Learning Algorithms:

Consider you own a company that has supermarket outlets and provides customers access to a range of products that comprise mainly of essentials and durables for everyday living.

You are asked to predict the sales during the Covid crisis as the inventory will need to be managed accordingly. The company has offered all its data to you for assisting your predictions.

Let us also consider that you know the following from the data provided to you:

Name of the Customer, Age, Contact Number, Transaction amount spent, Items Purchased, Category of Item Purchased, Date and Time of Purchase.

Let’s now see what the data really has. At the first look, it is right to assume that the data has very less number of fields and that we don’t really have a lot to find out of it. But with a bit of feature engineering and building some calculated fields, we can derive the following fields out of the pre-existent fields;

  1. Age Category from Age (by binning the age into discrete categories)
  2. Day of Purchase and Time of Purchase from Date and Time.
  3. Number of Transactions using the Contact Number or Name (as they are unique identifiers)
  4. Cumulative transaction amount or Total amount spent by individual customers (in the week or the month)

Note: It is important to note that the entire Machine Learning process comprises mostly of Data Preprocessing or preparing your data to fit into the ML model and the actual prediction makes up only around 10% of the process. The preprocessing of data and communication of results make up the 90% part of the process.

Once your data is set, you can identify which fields impact your target field. Calculating the correlation of one field with the target field based on Historical data will give you information about what actually impacts your target. For example, Contact Number and Name will not be of use to your model and so can be used as only unique identifiers. As there is no possibility of them impacting your understanding of the sales in the past or the near future. The day of purchase will probably give you an insight as to when the purchases happen and on which days the sales are higher than usual etc.

Once your understanding of the data is complete, you can improve the data and keep only those fields which actually will help you in your analysis. These build the basis or foundation for your ML model.

The name Supervised implies that the model is working on understanding the data based on Historical data or the Data we feed into it. Once the data is given to the system, it will try to analyze and fit a model to the data and give out predictions based off the data given.

Note: It might be necessary to scale your data to ensure there are no extreme changes in the values and it might be better to study your data and clean any outliers for better accuracy.

In this example, let’s use Linear Regression as the model. The ML model will use Linear Regression to fit a line into the sales figures of your data based on the parameters provided to it and the line will show you how your sales will move in the near future.

Linear Regression (Left) and LR vs Polynomial Regression Visualized (Right)

It is also possible that for greater accuracy, Polynomial Regression or Multilinear Regression be used.

Under Supervised Learning, the model will perform as good as the data fed to it. There is a possibility of Overfitting or Underfitting the data and has to be worked on with prejudice.

The Data in such models is split as Training Data and Testing Data. For example, if your data has 6000 entries, 5000 of them might be used to train the model and 1000 will be used for the model’s predictions. Upon comparison of the model’s predictions with the real values of the 1000 entries, we arrive at various metrics which help us analyze how well our ML model has performed and will thus then be changed or retrained to perform better or be deployed to be used.

90% of the times, a ML model will not be deployed owing to multiple factors but once you hit one of the 10% times it is deployed, it will prove to be very helpful to the organization.

Example for Unsupervised Learning Algorithms:

Consider the same example but now you don’t have any information except Sales figures and the customers names. You are asked to segment the customers and identify t hem accordingly to help the company with its marketing strategies.

How will you go about solving this case provided you only have the sales figures?

In this case the Unsupervised learning algorithms come into the picture when you don’t have an understanding of the data and have no supportive fields to help you with your analysis or in this case, Segmentation.

The most commonly used algorithm or process in this case would be ‘Clustering’. The system will try to set a number of clusters in the data (the number of segments needed in this case) and use a clustering technique combined with a distancing parameter to come up with the requisite clusters. Some methods of Clustering include:

  1. Agglomerative Hierarchical Clustering Methods
  2. Divisive Hierarchical Clustering Methods
  3. Grid Based Clustering Methods
  4. Partitioning Based Clustering Methods
  5. Density Based Clustering Methods
K Means Clustering Visualized using R (Left) and Python — Seaborn (Right)

In this Example, we’ll use K Means Clustering as an example which is a part of Partitioning Based Methods. Consider K = 3 number of clusters in this segmentation process. The system will assign 3 random entries to the clusters and then calculate a centroid or mid point to these clusters and start assigning the entries to them accordingly. The distance from the centroid (using methods like Euclidean Distance or Mahalanobis distance etc.) to the point of sales will be the prime factor to including a new element in a cluster.

This process is dynamic and the centroids keep changing and elements are shifted to arrive to 3 ( K number ) clusters in this segmentation which are highly related to each other (in ways that are not defined or understood here)

When you have no information about how to segment data and what the target columns are, Clustering comes to the rescue and can give you very accurate differentiations in the dataset and divide them to be studied ahead. Any new entry which comes into the dataset can be predicted to be a part of one of the clusters and can be strategized towards accordingly.

An easier understanding of the same is possible by visualizing these 3 clusters and changing the K to a new value in case the clusters are not apt enough. The marketing strategies as required in the example we considered will then be applied to these clusters individually as they imply groups of data which are closely related (based on just sales figures)

For any further doubts about any Unsupervised or Supervised Learning methods or a requirement of further resources, Comment down below!

In further articles, we’ll try to understand each of these ML algorithms in detail and understand the basic intuition behind them so that choosing them in varied situations is easier.

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science and Coding. Please leave a review down in the comments. It was a long article, thank you very much for reading it all the way here! Great going!

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss