COST FUNCTION

Akshita Guru
5 min readFeb 18, 2024

--

Welcome back! A linear regression model is a component of supervised learning, as we have already covered. Check it out here if you haven’t already. We are going to talk about the cost function in this article.

The cost function will tell us how well the model is doing so that we can try to get it to do better.

You have a training set below that contains input features x and output targets y. The model you’re going to use to fit this training set is this linear function f_w, b of x equals to w times x plus b. To introduce a little bit more terminology the w and b are called the parameters of the model. Sometimes you also hear the parameters w and b referred to as coefficients or as weights.

Depending on the values you’ve chosen for w and b you get a different function f of x, which generates a different line on the graph.

We’re going to take a look at some plots of f of x on a chart.

  • When w is equal to 0 and b is equal to 1.5, then f looks like this horizontal line. In this case, the function f of x is 0 times x plus 1.5 so f is always a constant value. It always predicts 1.5 for the estimated value of y. Y hat is always equal to b and here b is also called the y intercept.
  • As a second example, if w is 0.5 and b is equal 0, then f of x is 0.5 times x. When x is 0, the prediction is also 0, and when x is 2, then the prediction is 0.5 times 2, which is 1.The value of w gives you the slope of the line, which is 0.5.
  • Finally, if w equals 0.5 and b equals 1, then f of x is 0.5 times x plus 1 and when x is 0, then f of x equals b, which is 1 so the line intersects the vertical axis at b, the y intercept. Also when x is 2, then f of x is 2, so the line looks like this. Again, this slope is 0.5 divided by 1 so the value of w gives you the slope which is 0.5.

Just to remind you of some notation, a training example like this point here is defined by x superscript i, y superscript i where y is the target. For a given input x^i, the function f also makes a predictive value for y and a value that it predicts to y is y as shown here. For our choice of a model f of x^i is w times x^i plus b. Stated differently, the prediction y hat i is f of wb of x^i where for the model we’re using f of x^i is equal to wx^i plus b.

Now the question is how do you find values for w and b so that the prediction y hat i is close to the true target y^i for many or maybe all training examples x^i, y^i. To answer that question, let’s first take a look at how to measure how well a line fits the training data. To do that, we’re going to construct a cost function.

The cost function takes the prediction y hat and compares it to the target y by taking y hat minus y. This difference is called the error, we’re measuring how far off to prediction is from the target.

Finally, we want to measure the error across the entire training set. In particular, let’s sum up the squared errors like this. We’ll sum from i equals 1,2, 3 all the way up to m and remember that m is the number of training examples, which is 47 for this dataset.

To build a cost function that doesn’t automatically get bigger as the training set size gets larger by convention, we will compute the average squared error instead of the total squared error and we do that by dividing by m like this. We’re nearly there. Just one last thing. By convention, the cost function that machine learning people use actually divides by 2 times m.

Actually divides by 2 times m. The extra division by 2 is just meant to make some of our later calculations look neater, but the cost function still works whether you include this division by 2 or not. This expression right here is the cost function and we’re going to write J of wb to refer to the cost function. This is also called the squared error cost function, and it’s called this because you’re taking the square of these error terms.

In machine learning different people will use different cost functions for different applications, but the squared error cost function is by far the most commonly used one for linear regression and for that matter, for all regression problems where it seems to give good results for many applications.

We can rewrite the cost function J of wb as 1 over 2m times the sum from i equals 1 to m of f of x^i minus y^i the quantity squared. Eventually we’re going to want to find values of w and b that make the cost function small.

You have a cost function J, what the cost function J does is, it measures the difference between the model’s predictions, and the actual true values for y.

I hope you enjoyed learning about the cost function that we have covered so far. Keep an eye out for further articles on related topics in future editions.

You can connect me on the following:

Linkedin | GitHub | Medium | Email : akshitaguru16@gmail.com

--

--