Part II: Algorithms: The Pillars of Data Science

Published in

DataDreamsDragons

10 min readNov 6, 2018

A data scientist without his algorithms is like Joker without chaos, Bane without darkness, Obi wan Kenobi without his force, Wakanda without vibranium and Thanos without his gauntlet! I could go on but you get the gist!

As Bane says, you do not only adopt the algorithms! You need to be born in it! Molded by it! Or they will betray you and you will just be a man without plan like the Joker!

Enough said!

Let us start with a high level classification of algorithms used:

Supervised Learning
Unsupervised Learning

(There are other types of as well which are not mentioned or explained here! Remember, No Overfitting!)

Supervised Learning

In supervised learning, there are independent (or input or predictor or feature or X) variables which are used to predict the dependent (or output or predicted or response or target or Y) variable.

The key point here is that for the training set, we also have the corresponding value of target variable i.e. the desired solution or labels. We have the values for Y for every instance of X.

This is different from the Unsupervised learning where the value of Y is not available for the training dataset.

(Optional: For anyone looking to go into more detail, the below mentioned link is very useful:)

https://www.statisticshowto.datasciencecentral.com/independent-variable-definition/

Within supervised learning, the output variable can be of two types: Continuous and Categorical

Continuous Output (Y): In this case the the output variable can take a value on a continuous range. For such cases, we use Regression (a common example being Linear Regression). e.g. Prediction of price of houses (Kaggle)
Categorical Output (Y) : The output variable belongs to one of the two or more classes/categories. We use Classification (a common example being Logistic Regression). e.g. The most famous dataset — Predicting Survival for Titanic Passengers: The objective is to predict whether the output belongs to class 0 (Did Not Survive) or class 1 (Survived).

Regression:

We will start with Linear Regression since it is a simple and intuitive way to understand what is happening. There are other algorithms as well which can perform Regression. We will discuss that in the later posts.

Right now, the first objective is to develop an intuition for the commonly used jargon and terms and the basic flow of how any algorithm works. (Prioritize!)

Linear Regression and Flow of ML in general

Before we start, let me make it clear — Linear Regression is just a fancy name for a straight line! See, not so difficult now, is it! We all know the equation of a straight line:

y = mx + c

y = Output/Dependent/Response/Predicted variable
x = Input/Independent/Feature/Predictor variable
m = Slope
c = Intercept/Bias (in ML) and it is same as m0 used later.

That is all there is to Linear Regression in a nutshell!

Surprised and disappointed? Don’t be! Let us dive deeper now.

General Structure of Dataset in Supervised Learning

Imagine a case: (Read slowly and carefully! Read it again and then again and then once more!)

You are given the age of 5 children between 0 and 15 years. This is your independent/input variable X. It will be the training set.
You are given the corresponding height of these 5 children in centimeters. This is your dependent/output variable Y. It is a continuous variable. This data is already known. This will form the output for the training set.
You are asked to predict the height for the 6th child based on age which will be provided. This is the test set i.e. you will need to predict Y for a previously unseen X.
In a nutshell, this means you have been asked to build a Regression model to predict the height of a child given the age.

Now, go back and read the case once more!

If you are able to visualize this and separate the different parts in your head:

training data, X_training or input in training data, Y_training or known output in training data, Y_predicted or predicted output based on training data
test data, X_test or input in testing data, Y_test which is unknown, Y_test_predicted or predicted output based on testing input data,

mark my words — you have won a significant battle!

The above is the basic structure of data that you will receive for any Supervised Learning task.

Since the output is continuous, it is Regression. If the output was categorical, the task would be Classification.

General Flow of Machine Learning Modelling

1. Discovering the ‘Function’

If you were one of the diligent ones in your 11th standard, you might recall from basic algebra and calculus, we studied something called ‘functions’. Functions based expressed how one varying quantity (Y) depended on another (X). Basically, the function f() represented the link between X and Y. It was written as:

Y = f(X); Y is the function of X.

In ML, as you go deeper, you will see a close parallel happening in the algorithms.

In the above dataset and in fact, in any dataset for supervised learning, there will be X_training and Y_training which is known. This X_training and Y_training will have a relationship between themselves.

Our objective in ML is to discover a function which closely resembles this relationship between X_training and Y_training. We are not looking for an exact and perfect function but a reasonable estimation.

(Note: This estimation is not the end in itself. There is another part to it which is generalization. The generalization performance of a model indicates its prediction capability on independent and previously unseen test data. This is our ultimate goal — to predict the output for previously unseen data correctly and efficiently! This aspect is discussed briefly in a later post on Bias-Variance Tradeoff and separately with respect to every model. :))

This process of estimating the function is called modelling an ML algorithm. This function may be linear, exponential, polynomial etc.

In Linear Regression, we assume that this relationship is linear and it can be represented by a straight line. So, we take a straight line and pass it through the training data with the objective of trying to find the best slope (m) and intercept or bias (c ), so that the line is able to resemble the actual relationship as closely as possible.

Essence of linear regression: Fitting a straight line

Thus, Linear Regression makes predictions by using a weighted sum of input features (mx) and a constant term or intercept or bias (c ). To keep things simple, I have assumed only a single input (Age) for this particular example.

We will venture into the territory of multivariate regression now!

The names appear scary but in reality, it is another fancy name for a case where the input variable is more than 1 in number. e.g. Weight is added as another input variable in addition to Age in the above example. Thus, for every single point, the equation will become:

Y_predicted = m0 + m1 * Age + m2 * Weight

m0 is the intercept or bias here (same as c shown in the image).

We can generalize this as given below in form of matrix. Please observe the image given below very carefully and try and understand what is happening.

In matrix notation: Y_predicted = X.M

(In some cases, you will see an additional irreducible error term after X.M. We omit that here to keep it simple. I will explain that in bias-variance tradeoff discussion).

Let us select a particular instance, say y2. This can be represented in vectorised form as shown in image below. For the vectorised form, we will use m as notation.

Our goal is to calculate the values in the M matrix or m vector which are the ‘parameters’ of this algorithm. We need to find that value of M which minimizes Cost Function. What is Cost Function and how is this minimisation achieved? It is discussed in the points below.

2. The ‘Cost Function’

Remember, as we discussed in the previous point, we are trying to estimate a function that represents the actual relationship between X_training and Y_training.

This function gives an output Y_predicted based on X_training.

Please be very clear on this point: Y_predicted on X_training is different from Y_training which is already known.

Just to be clear, I am repeating myself for the nth time — there are three different components:

X_training
Y_training which is known for training set
Y_predicted which is predicted by the model during training based on X_training as input

Now, we are estimating a function which can output Y_predicted which is as close to Y_training which is already known. The objective is to minimize the difference between Y_predicted and Y_training! The smaller the gap, the better our estimated function resembles the actual relationship between X_training and Y_training. This is what is meant by the statement — ‘how well does a model fit the data’!

First, we need a metric, a measure or number of some sort which measures this difference between Y_training and Y_predicted. Then, we need a way to somehow minimize that difference between Y_training and Y_predicted for the training set.

This is where the ‘Cost Function’ comes into picture. And make no mistake, Picture abhi baaki hai mere dost! Also, keep in mind that different algorithms have different cost functions.

Mean Squared Error:

For Linear Regression, we use a metric called Mean Squared Error (MSE) as the cost function.

Please do not be afraid of the name! It is what is says. Let us break it down.

Mean + Squared + Error

a. Error: This is the difference of Y_training and Y_predicted for every point.

Point 1: (82.5–79.6) = 2.9 cm
Point 2: (94.2–97.3) = -3.1 cm
Point 3: (112.9–115) = -2.1
Point 4: (47–45.7) = 1.3 cm
Point 5: (160–159.2) = 0.8 cm

b. Squared:

Point 1 Squared: (2.9)² = 8.41
Point 2 Squared: (-3.1)² = 9.61
Point 3 Squared: (-2.1)² = 4.41
Point 4 Squared: (1.3)² = 1.69
Point 5 Squared: (0.8)² = 0.64

c. Mean: (Point 1 Squared + Point 2 Squared + Point 3 Squared + Point 4 Squared+ Point 5 Squared)/(Total Number of Points or Instances i.e 5) = (8.41+9.61+4.41+1.69+0.64)/5 = 4.952

So, you calculate the error for every point/instance in training, square the error and take its mean. And, voila, you get the Mean Squared Error.

This cost function, as we discussed, measures how far our estimated function or model is from the actual relationship and the goal is to minimize this value of MSE.

3. Minimising Cost Function: Training the Model

We discussed earlier that the ultimate objective is to estimate the M matrix from the equation Y = X.M in a way that it minimizes the Cost Function i.e. MSE for Linear Regression.

This is called training the model. There are different approaches to achieve this.

Normal Equation Approach: For Linear Regression, this can be done mathematically. It is clear that the equation is Y = X.M in the matrix form. Also, we saw another representation for the ith instance in the vectorized form. Here, the values of X_training (shown as X in the image) and Y_training are already known to us. Thus, all we need is to perform for linear algebra operations.

In the end, we get the solution as shown in the image.

This M is the value that will minimize the cost function and will be used for predictions.

There is an important catch here, however. This method becomes very slow if the number of input features becomes very large as it involves calculating the transpose and inverse of large matrices (X_training).

Gradient Descent Approach: Here, the values in the M matrix is randomly initialized i.e. we randomly assign values to M. Then, predictions are made. After that, the values in the M matrix are changed bit-by-bit (iteratively) in a way that the Cost Function decrease and ultimately reaches a minimum.

The details of how this approach works will be discussed separately. I have kept it brief here so as to keep this post to the point with respect to the general workings of ML algorithm.

So, concluding this discussion on general flow of ML algorithms and Linear regression in particular, we need to keep in mind the following points:

What is the structure of dataset?
Are the values for Y_training i.e. output variable available for training dataset meaning that is it a Supervised or Unsupervised learning problem
Is the Y_training i.e. output variable continuous or categorical meaning that is it a Regression or a Classification problem?
The goal is to discover the function f() which closely represents the actual relationship between X_training and Y_training. This function f() outputs Y_predicted based on X_training data.
Cost Function is used to determine how closely the function f() estimated above resembles the actual relationship between X_training and Y_training. The goal is to minimize the Cost Function.
Mean Squared Error is used as the Cost Function for Linear Regression. Different algorithms have different Cost Functions.
For Linear Regression, direct mathematical solution can be used to calculate the value of M which minimizes the Cost Function and obtains a good model. This mathematical solution is not available for other algorithms like Logistic Regression.
Gradient Descent is a generic approach for finding the optimal value of M matrix.

The concept of Bias-Variance Tradeoff and Regularisation will be discussed in separate posts.

Phew! That was a lot! If you have made it this far, you truly are a ‘Special One’, my friend.

In my next post, I will discuss Logistic Regression in detail.

See you there soon! Sayonara!