Ordinary least square method (P5.JS code) and Intro to Gradient Descent

10 min readAug 29, 2019

Summary: I learn best by playing around with code and believe in sharing knowledge. This the second blog of the new beginner friendly series — Basic Concepts You Should Know Before Starting with the “Neural Networks” (NN) with a code example for Ordinary Least Square Method and introduction to Gradient Descent. For more posts like these on Machine Learning Basics follow me here or on Twitter — Shaistha Fathima.

Basic Concepts You Should Know Before Starting with the “Neural Networks” (NN) Series

Hello everyone! Hope you have already read the first post of this series! If you have not read it yet, you could just read it here, and come back! I strongly suggest you to have a look at it.

By now you know the working of Ordinary Least Square Method! To put it simply, a line is created and made to move to the position most suitable for dividing max of the red points from the blue points like a partition, and help predict the position for the next point! And, don’t forget it “works on Linear Data Only”.

I think its better to show you, after all practical experience is the best experience!

Lets begin…

Linear Regression Example:

I am going to use p5.js to show how line moves as per the points. p5.js is a JavaScript library used for creative coding. It is based on processing which is a creative coding environment. The main focus of processing is to make it possible for beginners to learn how to program interactive, graphical applications, to make a programming language more user-friendly by visualizing it. The interesting part is that processing is not only for developers but also artists, designers, researchers and those who want to enjoy making arts.

Note: To know more about p5.js you may refer to the official website or this medium post by mhiratsuka on What is p5.js and How to use it?

Setup: Creating canvas and giving it a background

//var data, list with all points
var data = []// setup
function setup() {
  createCanvas(400, 400); //width and height = 400 
  background(51);
}

2. Create a point and store the data

function mousePressed(){
// map is used to grab the location of the mouse click wrt x and y
// map(value, start1, stop1, start2, stop2)  var x = map (mouseX, 0, width, 0, 1);
  var y = map (mouseY, 0, height, 1, 0);// creating point on click
  var point = createVector(x, y);// store in data
  data.push(point);
}

3. Display the point

function draw() {
  background(51);//get the point and draw it
  for(var i =0; i<data.length; i++){
    var x = map(data[i].x, 0, 1, 0, width);
    var y = map(data[i].y, 0,1,height, 0);
    fill(255);  // white color
    stroke(255);  
    ellipse(x,y,8,8);  // circle
  }
  // drawing the line wrt to the points
  if(data.length > 1) {
    linearRegression();
    drawLine();
  }
}

4. Draw Line

var m = 1;
var b = 0;function drawLine(){
  var x1 = 0;
  var y1 = m * x1 + b;  // eq for straight line
  var x2 = 1;
  var y2 = m* x2 + b;
  
  
  x1 = map(x1, 0, 1,0, width);      // map the points
  y1 = map(y1, 0, 1, height, 0);
  x2 = map(x2, 0, 1,0, width);
  y2 = map(y2, 0, 1, height, 0);
  
  stroke(255, 0, 255);  //color 
  line(x1, y1, x2, y2); //create a line
}

5. Linear Regression

// calculating the slope(m) and Y-intercept(b) function linearRegression(){
  var xsum = 0;
  var ysum = 0;
  
  for( var i=0; i<data.length; i++){
    xsum += data[i].x;
    ysum += data[i].y;
  }
  
//when too many points are in use, mean value is takenvar xmean = xsum/data.length;     
  var ymean = ysum/data.length;    
  
  var num = 0;
  var den = 0;
  
  for(var i = 0; i<data.length; i++){
    var x = data[i].x;
    var y = data[i].y;
    num += (x-xmean) * (y-ymean);
    den += (x-xmean) * (x-xmean);
  }/* slope = summation of(x-xmean) * (y-ymean) by summation of
 (x-xmean) * (x-xmean) */  m = num/den; 
  b = ymean-m*xmean;
}

Result:

You may run this code here on the p5.js code editor and experience it yourself

Initially with just two points,

when more points are added,

Quick Recap:

Now you know that, Linear regression is used to find the relationship between two variables by fitting a line (linear equation) to observed data. Where, one is predictor or independent variable and other is response or dependent variable

Example: a modeler might want to relate the weights of individuals to their heights using a linear regression model.

We also know that, the simplest method for fitting a regression line is ordinary least-squares method, which, finds the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations (distance of point from the line, d1, d2,.. remember!) from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared,and then summed, there are no cancellations between positive and negative values.

Gradient Descent

If I were to define it, then, Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

When do we use it?

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm. When its more complex than a simple 2D data we have discussed above.

What is a cost function?

This will get a bit technical, so please bear with me.

To put it simply, a cost function is a measure of “how good” a neural network did with respect to it’s given training sample and the expected output.

Example: In A Machine Learning Model

Consider a bunch of data points in a 2 D space. Assume that the data is related to the height and weight of a group of students. We are trying to predict some kind of relationship between these quantities so that we could predict the weight of some new students afterwards. This is essentially a simple example of a supervised Machine Learning technique.
Let us now draw an arbitrary line in space that passes through some of these data points. The equation of this straight line would be Y = mX + b where m is the slope and b is its intercept on the Y-axis.

Predictions

Given a known set of inputs and their corresponding outputs, A machine learning model tries to make some predictions for a new set of inputs.( just like predicting if Student 3 will be accepted or not!)

The Error would be the difference between the two predictions.

This relates to the idea of a Cost function or Loss function.

Cost Function

A Cost Function/Loss Function evaluates the performance of our Machine Learning Algorithm. The Loss function computes the error for a single training example while the Cost function is the average of the loss functions for all the training examples. Henceforth, I shall be using both the terms interchangeably.

A Cost function basically tells us ‘ how good ’ our model is at making predictions for a given value of m and b.

Let’s say, there are a total of “N’ points in the dataset and for all those ‘N’ data points we want to minimize the error. So the Cost function would be the total squared error i.e

Cost function for N data points

Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. To find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

A derivative is a term that comes from calculus and is calculated as the slope of the graph at a particular point.

Minimizing the Cost Function

The goal of any Machine Learning Algorithm is to minimize the Cost Function.

Why do you ask? That’s because a lower error between the actual and the predicted values signifies that the algorithm has done a good job in learning. Since we want the lowest error value, we want those‘ m’ and ‘b’ values which give the smallest possible error.

Gradient Descent Algorithm helps us to make these decisions efficiently and effectively with the use of derivatives.

In the equation, y = mX+b, ‘m’ and ‘b’ are its parameters. During the training process, there will be a small change in their values. Let that small change be denoted by δ. The value of parameters will be updated as m=m-δm and b=b-δb respectively. Our aim here is to find those values of m and b in y = mx+b , for which the error is minimum i.e values which minimize the cost function.

Rewriting the cost function:

The idea is that by being able to compute the derivative/slope of the function, we can find the minimum of a function.

Gradient descent with simple example

Imagine you’re standing somewhere on a mountain. You want to get as low as possible and as fast as possible, so you decide to follow these steps:

You check your current altitude, your altitude a step north, a step south, a step east, and a step west. Using this, you figure out which direction you should step to reduce your altitude as much as possible in this step.
Repeat until stepping in any direction will cause you to go up again.

Now, this is a wonderfully simple algorithm. You just have to figure out which direction the mountain is sloping where you are (this is the gradient) and take a step (this is your descent). This algorithm is great because it will quickly get you to a low point.

From our strategy, we would always have to walk in the direction that points the most downhill, and keep walking until it’s impossible to walk downhill any further.It’s not a bad strategy — it should work pretty well most of the time.

However, it has the weakness that it does not guarantee that it will find the lowest point out of the entire search area.

Example: The main downside to this strategy is that you might end up in a small hole, where if you could only walk uphill just a little bit, you could reach a much lower point.

Example: If you found a mountain lake, then you could cause your gradient descent to stop while you’re still high on the mountain. There are much lower areas that you could get to in relatively few steps, but your method will not lead you there. (There are extra things you can do that modify a gradient descent to make it less likely to get stuck in a small local minimum).

So, why or when do we use it?…

This algorithm is great for anything that requires finding a local minimum, especially if it is really difficult to find the solution analytically (i.e. by solving equations for exact solutions; for example, by taking derivatives and setting them equal to zero).

Here is one real world example I could find online!

“Inverse kinematics,” which is the process by which you figure out what angle each of the joints on an arm need to be in order to get the hand to wind up at the right place (sorry if this is too complex of an example; I’m trying to make it as layman friendly as possible). It turns out that it’s quite easy to take the angles of the joints and figure out where that will put the hand (this is “forward kinematics”), but it’s fairly challenging to do the reverse. This is an important thing to be able to do when working with robotics.
One of the standard solutions here is to use a gradient descent. You can “Check your current altitude” by using forward kinematics to find the hand’s position for the current joint angles. You can then “take a step north/south, etc” by adding or subtracting a little bit to each of the joint’s angles and recalculating the position. If the position is closer to your goal then that is “downhill;” farther is “uphill.” From there you increment the joints in the direction that takes the hand the biggest step towards the goal and you repeat until your hand gets where it needs to go. The cool thing about this approach is that it will make the hand take a nice straight path through space even if this means that the joints are doing weird things to get it there.

Conclusion

In the next post of this series, we will look into what a neural network is by answering a few simple questions, just an intro not too deep and learn about the building blocks of neural networks called Perceptron. As always you can always ping me for any doubts or reach out to me on twitter.

You may also take a look at my other series on “Introduction to Tensors” if interested!

Till then, stay tuned and happy learning!