Understanding Cost function for Linear Regression

Quantify the error between predicted values and expected values

Daketi Yatin

Published in

SRM MIC

5 min readMar 22, 2021

Introduction

This article shows the mathematical explanation of the cost function for linear regression, and how it works.

In the field of Machine learning, linear regression is an important and frequently used concept. Linear regression is nothing but creating an algorithm for predicting an output over a continuous set of values for output when a training set is given.

So this algorithm for predicting the output is known as hypothesis. It is a function of input which gives output. We choose this hypothesis on basis of given training set.

h₀(x) = θ₀ + θ₁x₁ + θ₂x₂ + ……… θₙxₙ

Here, xₙ are inputs of the training set, for which the outputs are yₙ. And θₙ are known as parameters. The hypothesis is chosen such that, it is close to, or coincides with the output.

The above is the hypothesis for multiple linear regression. The hypothesis for a single variable linear regression is given by

h₀(x) = θ₀ + θ₁x₁

For different values of parameters for a hypothesis, we get different predictions. So, how to choose a proper set of values for θ? One of the ways is the cost function.

Cost function

The cost function can be defined as an algorithm that measures accuracy for our hypothesis. It is the Root Mean Squared Error between the predicted value and true value.

We cannot go on assigning random values to the parameters to get an appropriate solution. The values of parameters that give the minimum value of the cost function are the appropriate values.

The cost function for regression is given by

Interpretation of Cost function

If we closely observe the cost function above, the term inside the summation is the square error term. So, what exactly is happening in the function is, it is finding the difference between the hypothesis and the output. The error gives us an idea of how accurate our hypothesis is. If the error is large, our hypothesis may not be accurate enough. If the error is low, our hypothesis may be accurate enough. So, if the error is as minimum as possible, that would derive our most accurate hypothesis for further predictions.

The (1/m) term before the summation denotes the mean. The (1/2) before that is really not important. It may or may not be included. It is usually included for further simplification when the derivative is applied. It does not affect the function.

Let us look into a few examples for better understanding.

Example

Say we are given a training set as follows

Say for some cases of single-variable linear regression, the input values(xₙ) and output values(yₙ) are given. If we plot the above values on a graph(input on x-axis and output on the y-axis), we obtain the following graph.

i) The hypothesis for single variable linear regression is a straight line. Let us randomly select values of parameters.
Let θ₀ = 1 and θ₁ = (1/3)

The hypothesis can be written as,

Showing the hypothesis on the graph.

Showing hypothesis and training set for i)

The black line denotes the hypothesis and the red denote the error between the hypothesis and output value.

Here, if we consider the absolute error instead of the square error, we get the error as zero. But actually, the error is not zero. For an error to be zero the line hypothesis line should pass through all points of the training set. That is why we consider square error rather than absolute error.

So the value of cost function for this hypothesis is given by,

ii) Now let us consider another hypothesis for the same training set.
Let θ₀ = 0 and θ₁ = 1.

Then hypothesis is

h₀(x) = x₁

Applying this hypothesis we get

J(θ₀, θ₁) = 0.

As the cost function is a sum of squares, its minimum possible value is 0. So this hypothesis is more accurate than the previous and any other hypothesis. But, this does not mean that for every training set the minimum cost function must be equal to 0. It only happens when they are linear. In the other case, the minimum value is other than zero.

When we use multivariable linear regression and a much more complex data set, the concept is applied. But for finding parameters many other algorithms like gradient descent and normal equation are used.

Conclusion

The choosing of the hypothesis is based on the parameters. It should be chosen in such a way that the hypothesis should be close to the values of output or either coincide with them. Coinciding with the output is not possible practically. So we should make sure that the error is minimum.

Some real-time examples may be as follows

So, the Cost Function shows its significance in measuring the performance of a Machine Learning model for given data. It quantifies the error between predicted values and expected values and presents it in the form of a single real number, thus holding an important place in the field of Machine Learning.

This article is written by Daketi Yatin. Inspired by prof. Andrew Ng ‘s Machine learning course.