Linear Regression — Deep View (Part 2)

4 min readFeb 20, 2019

I have discussed the basic concept of linear regression in part 1. Let’s start answering the questions in part 1. Before getting into that question let’s understand what is curve fitting?

Curve fitting is basically trying to fit a linear line or non-linear curve that better fits the data points.

Curve fitting, a probabilistic view

The goal in curve fitting problem is to make a prediction of the target variable Y given some new value of the input variable X on the basis of a training data comprising n input values X=(x₁,….xₙ) where xᵢ ’s are drawn from known probability density family p(X|θ) and the target value Y=(y₁,….,yₙ). We can express our uncertainty over the values of the target variable using a probability distribution. For example, we assume that given the value of x, the corresponding value of t has a Gaussian distribution. Thus finding the optimal value for θ is basically what curve fitting. Here θ is the parameter.

How to find optimal values for the parameter?

We cannot find the exact values of the parameters, but we can estimate values that are close to the actual parameter using a sample drawn from the population.

Methods to estimate the parametric values:

Maximum likelihood estimation

l(θ|X) is basically the likelihood of the parameter θ given a sample X. Since all xᵢ ’s are iid we can take the product over all the xᵢ. In maximum likelihood estimation, we are interested in finding the value θ that makes X more likely to be drawn. Thus, we need to search for θ that maximizes l(θ|X). We take log(.) over the likelihood and maximizes this log likelihood by making it’s derivative equal to zero. Thus,

Why do we need to take a log transformation?

Taking log is basically a monotonic transformation which converts the multiplication into addition since derivative is not linear operator in multiplication.

Consider the distribution for p(Y|X) as Gaussian, then the mean is basically θᵀ.X. As I mentioned in part 1 θᵀ.X is what our hypothesis and it the predicted value for given xi. The following image will depict it

Since probability distribution of y given x follows Gaussian distribution.

We need to find the optimal value for θ by using maximum likelihood.

Since all the values of X are iid, we have

By applying the log product gets transformed to sum

Hence maximizing the l(θ) gives the same as minimizing

which we recognize to be J(θ), our original least-squares cost function. Thus under our previous assumption of data, least squares regression as just doing maximum likelihood estimation.

There are other two ways for estimating the parametric value

2. Maximum posterior estimation

3. Bayes’ estimation

You can refer to these concepts in my reference book.

My next blog deals with how to use matrix derivatives to find the minimum of the cost function J(θ).

Reference

Introduction to machine learning by Ethem Alpaydin
CS229 Andrew Ng course
Pattern Recognition and Machine Learning by Christopher M. Bishop

Linear Regression — Deep View (Part 2)

Written by Midhilesh elavazhagan