Supervised v. Unsupervised Learning
Have you ever wondered how Facebook can tag your friends in your photos? Or how Amazon guesses what other products you might be interested in? These are examples of supervised and unsupervised machine learning. Supervised and unsupervised learning are two of the three major branches of machine learning (the other is reinforcement learning), but what’s the difference? Why do we need both? Which one is better? Let’s get into it!
First of all, let’s identify the primary difference between the two methods. Supervised learning is when a model is created with a labeled set of training data: if I want to make a model that will distinguish between photos of giraffes and horses, I will input a lot of pictures that are labelled either as a horse or a giraffe. For unsupervised learning, all of the training data is unlabeled, so we would just input all of the pictures. Instead of a class of giraffes or horses, the model will generate classes or categories based on the patterns and features it observes. For example, it may cluster the photos by color.
Supervised Learning
So, supervised learning uses labeled data sets. This provides the system with a feedback mechanism, so the model can improve its effectiveness. The model generated will make predictions based on that labeled data set.
Think of it like this: when you teach little kids how to read the alphabet, let’s say you show them a letter and say what it is — this is the labeled dataset. Then when they are on their own, they will identify letters based on what you said and the important attributes they associated with it. So if you teach them the letter ‘b’, the attributes they identify could be a vertical line and a circle on the bottom. Now when they see an ‘a’, ‘b’, and ‘c’ (assuming these are the letters you taught them) they will have a reasonable success rate identifying which is which.

The scenarios we’ve gone over so far are examples of classification for discrete distributions, which is a major use of supervised learning. However, it’s not the only one. Supervised learning is also used for continuous distributions, where the system must predict a quantitative variable rather than a categorical variable. This is called regression. It’s like a regression line for some points like you might have done in high school algebra, but predicting more complicated data trends with more variables.
There are two major considerations to determine the optimal model for supervised learning: model complexity and bias-variance tradeoff.
First, model complexity. When you don’t have a lot of data, you don’t want your model to be too complicated because then it may not extend accurately to new data. Think about if you have three points and you want to determine a model to predict where other points might lie. You could draw any number of curves through those points and have all sorts of nonsense happen between the known points. But why would you do that? You want to make the simplest curve, the lowest possible polynomial degree, so that it will generalize the information better. When a model is too specific to the training data, this is called overfitting. Your algorithm basically just memorized the training data.

The other consideration is the bias-variance tradeoff. The bias is the constant error term of your model and the variance is how much your bias varies from dataset to dataset. (Remember accuracy vs precision from middle school science? Your bias shows your inaccuracy and your variance is like the precision of your model.) So basically, if you have a high bias and low variance, your model may be consistently wrong 15% of the time. If you have a low bias and high variance, your model may be wrong between 2% and 40% of the time. This will be based on your training data, so the model might by wrong 2% of the time when the dataset is closest to your training data. That means your model was probably overfit to the data.
If you want to get into the basic math of it, let’s go! If you’re here to just get a preliminary understanding, you can skip this section; it’s basically the mathematical explanation of the consideration above.
~
So you have a labeled dataset, each with multiple parameters and a known output, right? If x is the vector representation of these parameters and y is your output, we want to find a function g: X → Y given our r labeled data points: {(x₁,y₁), (x₂,y₂), … (xᵣ,yᵣ)}. You can represent g as what will return the y value that produces the highest value of the scoring function f: X × Y → R.
While not always, often supervised learning models are probabilistic. A conditional probability model g(x) = P(y|x) will determine the likelihood of each output given your input. Another option, a joint probability model f(x,y) = P(x,y) will determine the probability that x and y will each fall within certain boundaries. Since this is for the scoring function, the best value of y can be found by going back to g.
Both f and g are only one of the many possible functions in the spaces of all possible functions, F and G respectively. To choose f and g, we use the considerations above. Empirical risk minimization chooses the function that best fits the training data, the one with the lowest bias. Structural minimization controls the bias-variance tradeoff.
A loss function L measures how well a function fits the training data- taking parameters of your known y and the y predicted by your function. The risk function R measures the expected risk of g by taking the average loss of all the known data points. To minimize empirical risk, R(g) is minimized.
However, empirical risk that is too minimized for a limited dataset results in overfitting, so structural risk minimization attempts to prevent that. How does it do that? It incorporates a regularization penalty, which basically adds a preference to simpler functions over more complex ones into your optimization, thus reducing the probability that you will end up with an overfitted model.
Think of C(g) as measuring the complexity of your model. Then the optimal model will be the g that minimizes J(g) = R(g) + ƛC(g), where ƛ is a parameter you can choose to balance your tradeoff. Note how the tradeoff is determined by human input still. And that’s how your algorithm determines the optimal model! Isn’t that cool?
~
The primary uses of supervised learning are risk assessment, image classification, fraud detection, and visual recognition. The algorithms commonly used are decision tree, logistic regression, support vector machine, naive bayes, and artificial neural networks. The best algorithm to use for a problem depends on its specific needs.
Now, there are some serious drawbacks to using labeled data for training. First of all you need high quality data. Your training set has to be representative of any data you may input. Remember the kids you taught what a ‘b’ looks like? Their process of classification will allow them to correctly identify a ‘b’ versus an ‘a’, but then if you suddenly show them a ‘d’ without having taught them, they may misidentify it as a ‘b’.
This shows us the first major problem with supervised learning. With a labeled dataset, there are only a set number of classes the algorithm can discern between. If you introduce something different, it will be incorrectly categorized because the system cannot handle unknowns. The bias of the training model is known as deterministic noise.
You also have the issue of stochastic noise. This is the unaccounted for random fluctuations or measurement errors in your training set that will add irrelevant complexity to your model. The stochastic noise of a dataset will never be 0. Since the output is determined entirely from the training data, the training data must be reliable.
Unsupervised Learning
Unsupervised learning works in a fundamentally different way. Since there’s no feedback mechanism or known classes, the algorithm organizes the data by finding patterns and inferring its natural structure.
Instead of a conditional probability pₓ(y|x), your model intends to infer a priori probability distribution pₓ(y) which is derived purely by deductive reasoning. So, it’s like if the child does not have the adult saying the names of the letters. If they were told then to organize the letters, they would try to identify commonalities, like maybe color or size. If a new letter is introduced, they would choose its category by those same characteristics.
Since there are no preassigned categories, instead of classification, the organization of data by qualitative variables is called clustering. With the many parameters, points with certain commonalities will be grouped into clusters. The data scientist can control how common the clusters should be, determining the number of clusters. The way unsupervised learning makes predictions for continuous distributions is called association, which uses the probability of cooccurrence of the variables from the dataset.

Unsupervised learning is primarily used for exploratory analysis and dimensionality reduction, both of which are normally starting points for data analysis.
Exploratory analysis automatically identifies structures and data trends that may be impractical, or even impossible, for humans to propose. This analysis is helpful for an initial insight that data analysts can then use to form a hypothesis.
Dimensionality reduction is also helpful for making hypotheses because it represents the data with fewer features, the important ones. So if you had a dataset of plane punctuality with five variables for each datapoint (let’s say inches of rain, month, day of the week, number of flights, and wind speed), it might eliminate inches of rain, month, and day of the week as unimportant variables. Just imagine how helpful this kind of reduction could be if you had a hundred variables for each point!
One difficulty of unsupervised learning is how you can evaluate its performance. Remember how we balanced our supervised learning model? All of those calculations relied on the known output values. One way you can test unsupervised learning is to remove the labels of labelled data and see if your results from the model correlate to the correct general groupings.
The use of unsupervised learning you may be most familiar with is market basket analysis. This clustering method is how Amazon recommends something else to buy along with what you’ve added to your cart already. Other common uses of unsupervised learning are semantic clustering, delivery store optimization, and identifying accident-prone areas. Some major unsupervised learning algorithms are k-means clustering, hierarchical clustering, and the apriori algorithm.

Pros and Cons
(pros in blue, cons in pink)

One reason unsupervised learning cannot go very far yet is that people don’t trust the resulting model if they can’t see exactly what it does. I mean, would you? We live in a time when tech is moving quickly, and to keep us safe, there are strict regulations in the use of algorithms to make sure that everything can be examined by humans.
You may think, why don’t we combine the two? That’s what semi-supervised learning essentially is. The computer is trained with some labeled data and unlabeled data (usually much more unlabeled data). This lends accuracy over unsupervised learning and radically decreases the costs from that of supervised learning. However, it’s still unclear what such algorithms really do, so they cannot gain quite as much popular approval. Also, this method is not so much better than supervised learning unless there is a nontrivial relationship between the output and the unlabeled data.
Direct comparison of the two doesn’t really make sense though. Both are useful for their respective purposes- how do you compare an apple and a monkey? Just like for choosing an algorithm, choosing your method depends on what you need to accomplish.
TL; DR


