Linear Regression- Deep View(Part 1)

Midhilesh elavazhagan
3 min readFeb 17, 2019

--

Linear Regression is a supervised learning algorithm where we are trying to predict continuous values.

What is Supervised Learning?

In Supervised Learning, we are given a training set with both features as x and labels as y and the algorithm learns a function h: x→y so that h(x) is a good predictor for the corresponding value y. Here h is called a hypothesis.

What is the hypothesis in linear regression?

To perform supervised learning we need to decide how we are going to represent the function h. Let’s decide to approximate y as a linear function of x

where θi’s are parameters parameterizing the space of linear function mapping x to y.

Given a training set, how are we going to learn the parameter θ?

One reasonable method makes h(x) close to y at least for the training example. For this let’s formulate J(θ) that measures how close h(xi)’s are to the corresponding yᵢ for each value of θ’s. Let’s call J(θ) as cost function:

How to update the parameters?

we want to choose θ so as to minimize J(θ). To do this let’s use a search algorithm called gradient descent.

where α is called the learning rate. After taking partial derivate on J(θ)

For a single training example, the update rule is

This rule is called the LMS update rule (“LMS” — ”Least Mean Square”) or Widrow-Hoff learning rule.

How to train the model using LMS rule?

Batch gradient descent: (updating θ only after the whole dataset )

Repeat until convergence {

( for every j ) }

Stochastic gradient descent: (update θ for every data point)

Loop{

for i=1 to m {

(for every j) }

}

Mini-batch gradient descent: (It is a trade-off between batch gradient descent and stochastic gradient descent. we split the dataset into mini-batches and for every mini batch, we update the weights. The cost function is averaged over all the batches)

Here J is a convex quadratic function. The ellipse shown below is the contours of a quadratic function and a sample trajectory taken by gradient descent for the above three methods.

From the above image, it is clear that the stochastic gradient descent takes a long time to reach optimum with a zig-zag path. Batch gradient descent takes a smooth path but not guaranteed to reach optimum. Mini-batch gradient descent takes less time, less zig-zag path when compared to stochastic gradient descent but reaches the optimum.

“Why might least-square cost function J, be a reasonable choice?.” Let’s see the answer to these questions in part2.

--

--