## Understanding and Calculating the Cost Function for Linear Regression

This post will focus on the properties and application of cost functions, how to solve it them by hand. Then we will implement the calculations twice in Python, once with `for`

loops, and once with `vectors`

using numpy. This goes into more detail than my previous article about linear regression, which was more a high level summary of the concepts.

When learning about linear regression in Andrew Ng’s Coursera course, two functions are introduced:

- the cost function
- gradient descent

At first I had trouble understanding what each was for. Together they form linear regression, probably the most used learning algorithm in machine learning.

## What is a Cost Function?

In the case of gradient descent, the objective is to find a line of best fit for some given *inputs*, or X values, and any number of Y values, or *outputs*. A cost function is defined as:

…a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

from Wikipedia

In this situation, the event we are finding the cost of is the difference between estimated values, or the *hypothesis* and the real values — the actual data we are trying to fit a line to.

To state this more concretely, here is some data and a graph.

`╔═══════╦═══════╗`

║ X ║ y ║

╠═══════╬═══════╣

║ 1.00 ║ 1.00 ║

║ 2.00 ║ 2.50 ║

║ 3.00 ║ 3.50 ║

╚═══════╩═══════╝

Pretty boring graph. The goal here is to find a line of best fit — a line that approximates the values most accurately. Here are some random guesses:

`╔═══════╦═══════╦═════════════╦════════════╦════════════╗`

║ X ║ y ║ best_fit_1 ║ best_fit_2 ║ best_fit_3 ║

╠═══════╬═══════╬═════════════╬════════════╬════════════╣

║ 1.00 ║ 1.00 ║ 0.50 ║ 1.00 ║ 1.50 ║

║ 2.00 ║ 2.50 ║ 1.00 ║ 2.00 ║ 3.00 ║

║ 3.00 ║ 3.50 ║ 1.50 ║ 3.00 ║ 4.00 ║

╚═══════╩═══════╩═════════════╩════════════╩════════════╝

Making that beautiful table was really hard, I wish Medium supported tables. Anyway, we have three hypothesis — three potential sets of data that might represent a line of best fit. The slope for each line is as follows:

`best_fit_1`

: 0.5`best_fit_2`

: 1.0`best_fit_3`

: 1.5

`best_fit_2`

looks pretty good , I guess. But we are data scientists, we don’t guess, we conduct analysis and make well founded statements using mathematics.

## Understanding the Cost Function

Let’s do an analysis using the squared error cost function.

Remember a cost function *maps* *event or values of one or more variables onto a real number. *In this case, the event we are finding the cost of is the **difference between estimated values**, or the difference between the *hypothesis* and the real values — the actual data we are trying to fit a line to.

Let’s unwrap the mess of greek symbols above. On the far left, we have 1/2*m. m is the number of samples — in this case, we have three samples for X. Those are `1`

, `2`

and `3`

. So `1/2*m`

is a constant. It turns out to be `1/6`

, or `0.1667`

.

Next we have a sigma.

This means the sum. In this case, the sum from `i`

to `m`

, or `1`

to `3`

. We repeat the calculation to the right of the sigma, that is:

for each sample.

The actual calculation is just the hypothesis value for `h(x)`

, minus the actual value of `y`

. Then you square whatever you get.

The final result will be a single number. We repeat this process for all the hypothesis, in this case `best_fit_1`

, `best_fit_2`

and `best_fit_3`

. Whichever has the lowest result, or the lowest “cost” is the best fit of the three hypothesis.

Let’s go ahead and see this in action to get a better intuition for what’s happening.

## Calculating the Cost Function by Hand

Let’s run through the calculation for `best_fit_1`

.

`╔═══════╦═══════╦═════════════╗`

║ X ║ y ║ best_fit_1 ║

╠═══════╬═══════╬═════════════╣

║ 1.00 ║ 1.00 ║ 0.50 ║

║ 2.00 ║ 2.50 ║ 1.00 ║

║ 3.00 ║ 3.50 ║ 1.50 ║

╚═══════╩═══════╩═════════════╝

We know the 1/2m is simply 1/6, so we will focus on the summing the result of:

for `best_fit_1`

, where `i = 1`

, or the first sample, the hypothesis is `0.50`

. This is the `h_theha(x(i))`

part, or what we *think* is the correct value. The actual value for the sample data is `1.00`

. So we are left with `(0.50 — 1.00)^2`

, which is `0.25`

. Let’s add this result to an array called `results`

.

`results = [0.25]`

The next sample is `X = 2`

. The hypothesis value is `1.00`

, and the actual `y`

value is `2.50`

. So we get `(1.00 — 2.50)^2`

, which is `2.25`

. Add it to `results`

.

`results = [0.25, 2.25]`

Lastly, for `X = 3`

, we get `(1.50 — 3.50)^2`

, which is `4.00`

.

`results = [0.25, 2.25, 4.00]`

Finally, we add them all up and multiply by `1/6`

.

`6.5 * (1/6) = 1.083`

. The cost is `1.083`

. That’s nice to know, but we need some more costs to compare it to. Go ahead and repeat the same process for `best_fit_2`

and `best_fit_3`

. I already did it, and I got:

`best_fit_1: 1.083`

best_fit_2: 0.083

best_fit_3: 0.25

A lowest cost is desirable. A low costs represents a smaller difference. By minimizing the cost, we are finding the best fit. Out of the three hypothesis presented, `best_fit_2`

has the lowest cost. Reviewing the graph again:

The orange line, `best_fit_2`

, is the best fit of the three. We can see this is likely the case by visual inspection, but now we have a more defined process for confirming our observations.

## Use with Linear Regression

We just tried three random hypothesis — it is entirely possible another one that we did not try has a lowest cost than `best_fit_2`

. This is where Gradient Descent (henceforce GD) comes in useful. We can use GD to find the minimized value automatically, without trying a bunch of hypothesis one by one. Using the cost function in in conjunction with GD is called linear regression.

This will be the topic of a future post. For now, I want to focus on implementing the above calculations using Python.

## Calculating the cost function using Python (#1)

As promised, we perform the above calculations twice with Python. Once using `for`

loops, and once using vectors. Firstly, with `for`

loops. The focus of this article is the cost function, not how to program Python, so the code is intentionally verbose and has lots of comments to explain what’s going on.

Personally, the biggest challenge I am facing is how to take the theoretically knowledge and algorithms I learned in my undergraduate calculus classes (I studied electrical engineering) and turn them into working code.

The way I am breaking this barrier down is by really understanding what is going on when I see a equation on paper, and once I understand it (usually after doing several iterations by hand), it’s lot easier to turn into code. Hopefully this helps other people, too.

It is more common to perform the calculations “all at once” by turning the data set and hypothesis into matrices. This process is called *vectorization*.

## Calculating the cost function using Python (#2)

It’s a little unintuitive at first, but once you get used to performing calculations with vectors and matrices instead of for loops, your code will be much more concise and efficient. Here is the same calculation implemented with matrices using numpy.

Even without reading the code, it’s a lot more concise and clean.

We are using numpy, and defining `X`

and `y`

as `np.array`

. They provide many more properties for doing vector and matrices multiplication. There are two things to note:

- We prepend the
`X`

vector with a vector`1`

s. This is explained well by Andrew Ng here, in week 1’s “matrix matrix multiplication” lecture. Basically, when you multiply matrices, you need to have the correct dimensions. The vector of 1s allows for this. This is best explained using video, I recommend you watch it on Coursera. - The calculation we did in
`for`

loops previously is now expressed in two lines:

`inner = np.power(((X @ theta.T) - y, 2)`

Which is doing this:

and

`np.sum(inner) / 2 * len(X)`

which is doing

Again, I encourage you to sign up for the course (it’s free) and watch the lectures under week 1’s “linear algebra review”. It takes a while to really get a feel for this style of calculation. I went through and put a ton of `print`

statements, and inspected the contents of the array, as well as the `array.shape`

property to really understand what was happening.

## Conclusion

Researching and writing this really solidified by understanding of cost functions. I hope to write a follow up post explaining how to implement gradient descent, first by hand then using Python.