Machine Learning Path (III)
Linear Regression — Cost Function
In the last article, I’ve introduced the basic concept how Linear Regression works. Take predicting from student’s midterm scores to the final exam scores result as an example, we can use something called hypothesis function that helps us to predict the final result (which is the output feature) from the input features. The basic formula is shown below —
Additionally, we’ve concluded that our objective is to find a suitable parameter θs for our hypothesis function that matches the dataset well. In this article, we are going to explore and introduce another thing called the Cost Function (or also be referred as the Loss Function or Error Function) which helps us to develop our machine learning algorithm.
By the way, this series of articles is my note taken from the course by Andrew Ng on Coursera. If you are interested, take the course online and I believe that it can also give you a more detailed explanation on Machine Learning algorithms.
The goal of the cost function is to help us measuring the accuracy of the prediction. The formula of the cost function is presented below —
This function seems to be complicated, however, let me break it down into simple unit in order to explain the detail what exactly cost function is calculating. This cost function is also called the squared error function or mean squared deviation if you might recall that this is from the field of Statistics. Never heard of this term? That’s not a big deal, the following figure shows the meaning of calculating the formula inside the summation which is presented below —
Step 1. Calculate the Error of the Sample
From the illustration above, we know that inside the summation, we are calculating the subtraction between each sample’s prediction and its actual result. This is called the cost or the error of the hypothesis function.
For instance, from a student whose midterm score is 70 (the input feature x) and we have a trained hypothesis function which result in predicting 80 (which is the predicted output h_θ(x)) as his or her final exam score. However, its actual final exam score is 85 (the output feature y), then the error of the hypothesis function is just 80 - 85 = -5 points.
Step 2. Squared the Error
However, the cost function also squared the error of the hypothesis. This results in a non-negative value which also make sure that it is better to minimize the error as possible. So, the best result is to have 0 error which is in a rare and ideal condition. In other words, we wanted to make sure that the predicted result is closer to the actual result.
Step 3. Sum Up All Squared Errors
Recall from the previous article, the notation of m represents the number of the samples. This big sigma notation sums up all of the squared errors of m samples.
Step 4. Take Average of the Sum of the Error
Lastly, the summation of the errors is divided by 2 * m, which gives us the average (or the mean) of the errors among the m samples. (The reason why divided by 2 is for calculus purpose, this is for simplifying some calculation procedure after developing the machine learning algorithm.) So that is how the concept of the cost function came from — calculating the mean squared errors of the samples.
Illustrate Cost Function J(θ)
First, in order to plot the figure of the cost function, let’s take the example from the previous article where we got three samples of data mapped between midterm and final exam scores. (m = 3) You may try yourself to calculate the cost in different hypothesis function which in this particular example, we simplified our θ_0 is equal to zero. The calculation is illustrated below —
From the diagram above, it is very clear that the perfect fit of the hypothesis gets zero mean squared error. But that’s not enough, we can plot a graph that the x-axis is the θ_1 with respect to the y-axis which represents the value of the cost function J(θ). And that certainly gives us an interesting result! (Abbreviation W.R.T. stands for “with respect to”)
From the diagram, if you tried to plot the graph, it will result something like a parabolic line. In the field of machine learning, we often called it a convex function. Convex function often comes with bow-like shape or pattern and it contains global optimum.
So, we have come up another point that —
Cost function is a kind of convex function.
However, convex functions often not only contain a single absolute optimum. It might contain multiple local optima. (the plural of optimum is optima) Hence, if we get stuck at the local optimum, it will not have best fit of our data and result in not a good hypothesis. This problem is kind of nasty. Thus, we have another important point needs to be addressed —
Training a good model is the same as finding the global optimum of the cost function. This kind of problem is called convex optimization.
By the way, convex optimization not only applied to the field of machine learning, but it is widely used in several different fields — like automatic control systems, estimation and signal processing, communications and networks, electronic circuit design, data analysis and modeling, finance, statistics (optimal design) and structural optimization. (See more on Wiki)
So, walking through the definition of the Cost Function, the calculation process and lastly, plotting out the figure which shows you the property of the cost function is similar to the convex function. In other words —
Solving the convex optimization problem is also an important issue in the field of machine learning.
New Directive — Minimize the Cost Function
Concluded from the meaning of the cost function, our next step is to find the parameter θ with the minimum cost.
Some of you may had an experience on Calculus. You may think that the next step is to take the derivatives of the cost function and then assign it to the value of zero. Because we have two theta parameters, so we need to take partial derivatives to each of them :
This is for letting you to know you can solve the problem in an analytical way. If you solved and then get the result of the partial derivative, we refer to that result as the Normal Equation of linear regression based on the cost function. However, the details of Normal Equation will be discussed in the later sections when talking about multivariate linear regression and familiar with vectorization.
By the way, we won’t discuss about Calculus in this series of article, you can try to derive the equation yourself or just memorize the conclusion of the section. Don’t need to be panic on how this equation is being derived in some way, instead knowing the intuition what those equation gives you and just think of them as a tool to use them first.
Last but not least, in this article, you now know the concept of Cost Function in linear regression —
Cost Function measures the accuracy of the hypothesis function
Following the steps listed below, you can get the cost of the hypothesis function —
- Calculate the error between the predicted and the actual value
- Square each error of the samples
- Sum up all of the squared errors
- Take the average of the sum of the squared error (divided by 2 * m)
Plotting out the graph of the cost function, we concluded that —
Cost function is also a kind of convex function. The best fit of the hypothesis appears on the minimal global optimum of the convex. Solving the global optimum in the convex function is called convex optimization.
Solving this kind of problem analytically using Calculus — taking partial derivative on the cost function — can result something called Normal Equation which will be presented in the future article.
Instead of introducing Normal Equation, the next article will use the cost function and derive the first machine learning algorithm called Gradient Descent. It is a kind of iterative algorithm that helps us to get the parameters θ of the hypothesis function closer to the fit of the dataset. Okay, so see you in the next article ~