Supervised Learning

Mike Yung
5 min readSep 5, 2016

--

Machine learning — the hot buzzword that many aspiring data scientists are most eager to learn in the pursuit of our field. AI pioneer Arthur Samuel phrased it as the “field of study that gives computers the ability to learn without being explicitly programmed”. It wasn’t until this week that the magic really started to happen. Theory finally put into practice, and regular code finally put into algorithmic structure.

Machine learning is broadly split into supervised and unsupervised learning. Simply put, supervised learning is used when we know what we are trying to predict (a y variable exists). Within supervised learning, there are generally speaking two types of models: regressors and classifiers. Regressors aim to predict a continuous outcome (e.g. weather in degrees Celsius, income in Euros, weight in kg, etc), while classifiers aim to predict a categorical/binary outcome (whether or not it rains, whether or not a customer will churn in the next 30 days, positive/negative/neutral sentiment of a given tweet, etc).

I suppose linear regression is still technically machine learning, but our familiarity of the basic idea (with concepts relating back to the high school y=mx+b) made it feel more rudimentary than it should. That is not to discredit linear regression, however, as it is one of the models least prone to overfitting, and has great benefits in its interpretability — an aspect that typically diminishes as you increase the complexity of your model.

Logistic Regression

Upon learning linear regression, it was only natural to then talk about logistic regression. It’s somewhat misleading to call it logistic regression when in reality it serves the purpose of a classifier. The root idea is similar to linear regression, but after we acquire the beta coefficients for each predictor, a link function takes these coefficients and turns them into an S-shaped sigmoid function. In the graphic below, x is some arbitrary predictor and y is the probability that we will predict a particular outcome (in this case, red or yellow). The default rule sets a threshold at 0.5 — if y>0.5, we predict red, if y<=0.5, we predict yellow. The red and yellow circles represent the actual/real data points, while the blue line represents our model’s probabilistic predictions.

Logistic Regression: S-shaped sigmoid function

With such rule set in place, it’s easy to see that our model isn’t perfect 100% of the time. When x is roughly greater than 5 (the point where y>0.5), we predict red, but there are in fact a few yellow circles in the x ∈ [5,10] zone, so we would have misclassified those yellow circles as red ones. The same holds true in the opposite direction. This introduces the idea of Type I and Type II errors, a powerful tension that exists in all domains of decision-making (think cancer diagnosis, crime conviction, portfolio selection, etc).

Confusion Matrix: the tension between Type I and Type II errors

In a nutshell, if you want to be more confident in identifying the ‘positives’, you increase your probability of mislabeling the negatives. The logic holds for the opposite as well. This confusion matrix is applicable in all types of classifier models.

k-Nearest Neighbors (kNN)

To me, this is one of the most intuitive, easy-to-understand machine learning models from a layperson’s perspective. The basic idea is that given an unseen data point, we identify k data points in our training set that is the most ‘similar’ to our unseen data point, and classify that point as the majority class of those ‘neighbor’ points. In more technical terms, we define some distance metric — Euclidean distance, Manhattan distance, cosine similarity, etc. — and find the k data points whereby the distance metric is minimized. This model is unique in that the model fitting process requires just ‘saving’ the data (essentially zero cost). The bulk of the computational cost is in the prediction process, since it has to go through every single point in the training set to generate a single prediction. k-Nearest Neighbors breaks down if we have too many predictors (>10), which can explained by the Curse of Dimensionality. The idea here is that as the number of dimensions increases, the amount of ‘space’ in the high-dimensional space reaches infinity. Every data point essentially becomes so far from one another that it loses meaning to compare ‘neighbors’ with a distance metric.

Decision Trees & Random Forests

Decision Trees also have a high level of interpretability, and its visual form can often be understood by even those averse to quantitative ideas. The data is split at each ‘node’, by a decision function that forces the data to travel left or right, given the terms of that function. When there is nothing else to split on, these leaf nodes represent the predictions of the model.

A visual representation of Decision Trees.

When I explain Decision Trees to my friends, they all pretty much claim to know the concept without needing any further explanation, yet almost all of them are quick to add, “but how do you know which variable to split on, and at what threshold/value?” This introduces the idea of information gain. In simple terms, the algorithm goes through every possible predictor and every possible threshold, computes the ‘information gained’ from all those potential splits, and chooses the split that gives us the greatest information gain (i.e. best separates the two classes). One shortcoming of a decision tree is its proneness to overfitting. Enter Random Forests. This model essentially aggregates the performance of many Decision Trees (typically in the hundreds), each of which uses only a random subsample of our training set. In the case of a classifier, the predicted outcome is simply the majority vote of those hundreds of trees. If we’re using a regressor, then the predicted outcome is the average of those trees.

A huge benefit of Decision Trees and Random Forests is their ability to model non-linear relationships, and interaction effects are naturally encoded into the way the model is set up. The downside, however, is that the outcomes are not probabilistic, i.e. we don’t have a good grasp as to how confident we are in our predictions, an area that a simpler model like Logistic Regression may have an upper hand on.

--

--