FUNDAMENTALS OF SUPERVISED LEARNING FROM SCHOOL MATHEMATICS !!

Published in

DataSeries

7 min readApr 28, 2020

In my previous blog, I had given an Introduction to Machine Learning. If you haven’t read it, you can read it here. This will just brush up what has been introduced earlier.

Ready for some School Maths? 👨‍🎓

This blog contains Linear Regression in Supervised Machine Learning. By the end of this, the reader would have learned how a model learns to predict from the labeled data by concepts like Gradient Descent, Squared Error Function.

Let's start with an analogy.

When we started to learn alphabets during childhood, a tutor would firstly make us write holding our hand and spell it like- a for apple…b for bat and so on. This is the initial step taken to train our human brain. The next time, we were asked to write on our own and in the beginning, mistakes are common. For getting rid of these mistakes, we were evaluated on how large the mistake is and we were corrected. The human brain now has been improvised on recollecting, writing, and spelling out alphabets. This is Supervised Learning. In this type of learning, the machine learning model is trained on data that is labeled ie. it knows the correct expected output and made to recorrect its understanding based on the error from the previous output.
Now, imagine you are given a diagram like the one below. Putting aside what this data represents, when we see it, we can make 3 clusters. These clusters can depict data like astronomical clustering, Business Sectors, etc. Such a method of learning where the data has no labeled responses is known as Unsupervised Learning.

SUPERVISED LEARNING

In supervised learning, there can be two types based on the output. When the predicted value is continuous (House Price Prediction), it is a Regression problem and when output is set of Discrete Values, say, if a person has Diabetes(1) or doesn’t have (0).

Univariate Linear Regression

When a model takes in a single variable x to predict y, it is known as Univariate Linear Regression.

We will talk in detail about:

Hypothesis
Cost Function
Gradient Descent

Hypothesis

Do you remember the equation for a straight line?? Now there's a purpose why you learned it.

y=m*x+c (m=slope of the line, c=y axis-intercept)

In Machine Language, we define a hypothesis as :

Doesn’t this look similar to the equation of the line? Indeed it is.

Time for some work to the brain..

From the table below, guess the values of the parameters. (Theta0 and Theta1)

Hope you figured out. For x=4, y=11. Initially, your brain had made a guess by looking at its first (x,y) pair thinking Theta0=0, Theta1=3. By checking next pair, you figure out the error and finally get Theta0=3 and Theta1=2.

Below is the plot for a few values of Theta1.

From the graph, it is clear that Theta1=1 (assume Theta 0=0) fits the actual data perfectly. Congrats, you have done your first prediction! Extend this problem to House Price(y) vs Area_of _house(Theta 1). If both parameters are considered, then three dimensions exist ie. X, Y, Z.

Also, the Price of the house doesn't depend only on Area but also depends on No_of_bedrooms, Size_of_bedroom, etc which adds up as the parameters. For ’N’ features, we have an N-Dimensional Space ie.

Therefore, hypothesis is a function that maps the input data (x’s) to the output data (y’s) . Theta 0 is also called bias.

2. Cost Function/Mean Squared Error

We have seen how to make a guess and fit a random line but we don’t know how accurate the fit is. In order to solve this, we use a Cost Function in order to calculate the accuracy of the hypothesis function.

Cost Function takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s compared to the actual output y’s.

1/2 is for simplicity sake and will be helpful when we see Gradient Descent. This function is also called Mean Squared Error or Squared Error Function as the error between each value is squared and summed up. Since we assume that Theta 0=0, we can represent the cost function here as J(ϴ)

By plotting the hypothesis for different values of predictions and calculating the respective Cost Function, we get:

Here we have seen three values. Try solving for ϴ1=0 and ϴ1=2. Below is the tabular representation of the predicted parameter and its respective cost function.

This can be extended to n number of predictions. By plotting these values, we get a parabolic curve :

From the graph, it is clear that when ϴ=1, J(ϴ) is at global minima or the least value. This makes sense — Just look in the plot of h(ϴ) vs ϴ for ϴ=0. The predicted line passes through all the points and looks like it is the best fit.

3. Gradient Descent

While computing the Cost Function, we have done trial and error and arrived at the global minima value by trying different values. There must be a better way right?

Gradient Descent solves this by the concept of Slope of a line which you would have come across during your schooling. It is a Minimizing Function, in this case, it minimizes the Cost Function. Here’s how it works.

Case 1: At Point ‘a’, the slope is decreasing or negative. In order to reach to the minima point, the value of ϴ should be shifted to the right (Increase ϴ)

Case 2:At Point ‘b’, the slope is Increasing or Positive. In order to reach to the minima point, the value of ϴ should be shifted to the left as indicated(Decrease ϴ)

From the above cases, if the ϴ value is adjusted according to the slope at that point, Global Minima can be achieved.

Initially, we initialize Theta0 and Theta1 to zero and the Gradient Descent takes steps and reaches the value that minimizes the Cost Function.

α, alpha, is the learning rate, or how quickly we want to move towards the minimum.

If α is too large, we might not end up in the global minima.
If α is too small, we might take lots of iterations to end up at the global minima.

Learning Rate can be correlated with us learning alphabets. If we try to learn too fast, we might miss few details and make mistakes. On the contrary, if we learn too slowly, we might end up learning Alphabets lifelong. Therefore, an optimal value of learning rate (α) is necessary to achieve the global minima in a feasible manner.