Linear Regression (LR) is one of the main algorithms in Supervised Machine Learning. It solves many regression problems and it is easy to implement. This paper is about Univariate Linear Regression(ULR) which is the simplest version of LR.
The paper contains following topics:
- The basics of datasets in Machine Learning;
- What is Univariate Linear Regression?
- How to represent the algorithm(hypothesis), Graphs of functions;
- Cost function (Loss function);
- Gradient Descent.
The basics of datasets in Machine Learning
In ML problems, beforehand some data is provided to build the model upon. The datasets contain of rows and columns. Each row represents an example, while every column corresponds to a feature.
Then the data is divided into two parts — training and test sets. With percent, training set contains approximately 75%, while test set has 25% of total data. Training set is used to build the model. After model return success percent over about 90–95% on training set, it is tested with test set. Result with test set is considered more valid, because data in test set is absolutely new to the model.
What is Univariate Linear Regression?
In Machine Learning problems, the complexity of algorithm depends on the provided data. When LR is used to build the ML model, if the number of features in training set is one, it is called Univariate LR, if the number is higher than one, it is called Multivariate LR. To learn Linear Regression, it is a good idea to start with Univariate Linear Regression, as it simpler and better to create first intuition about the algorithm.
To get intuitions about the algorithm I will try to explain it with an example. The example is a set of data on Employee Satisfaction and Salary level.
As it is seen from the picture, there is linear dependence between two variables. Here Employee Salary is a “X value”, and Employee Satisfaction Rating is a “Y value”. In this particular case there is only one variable, so Univariate Linear Regression can be used in order to solve this problem.
In the following picture you will see three different lines.
This is already implemented ULR example, but we have three solutions and we need to choose only one of them. Visually we can see that Line 2 is the best one among them, because it fits the data better than both Line 1 and Line 3. This is rather easier decision to make and most of the problems will be harder than that. The following paragraphs are about how to make these decisions precisely with the help of mathematical solutions and equations.
Now let’s see how to represent the solution of Linear Regression Models (lines) mathematically:
- hθ(x) — the answer of the hypothesis
- θ0 and θ1 — parameters we have to calculate to fit the line to the data
- x — the point from the dataset
This is exactly same as the equation of line — y = mx + b. As the solution of Univariate Linear Regression is a line, equation of line is used to represent the hypothesis(solution).
Let’s look at an example. For instance, there is a point in the provided training set — (x = 1.9; y = 1.9) and the hypothesis of h(x) = -1.3 + 2x. When this hypothesis is applied to the point, we get the answer of approximately 2.5.
After the answer is got, it should be compared with y value (1.9 in the example) to check how well the equation works. In this particular example there is difference of 0.6 between real value — y, and the hypothesis. So for this particular case 0.6 is a big difference and it means we need to improve the hypothesis in order to fit it to the dataset better.
But here comes the question — how can the value of h(x) be manipulated to make it as possible as close to y? In order to answer the question, let’s analyze the equation. There are three parameters — θ0, θ1, and x. X is from the dataset, so it cannot be changed (in example the pair is (1.9; 1.9), and if you get h(x) = 2.5, you cannot change the point to (1.9; 2.5)). So we left with only two parameters (θ0 and θ1) to optimize the equation. In optimization two functions — Cost function and Gradient descent, play important roles, Cost function to find how well the hypothesis fit the data, Gradient descent to improve the solution.
Cost function (Loss function)
In the examples above, we did some comparisons in order to determine whether the line is fit to the data or not. In the first one, it was just a choice between three lines, in the second, a simple subtraction. But how will we evaluate models for complicated datasets? It is when Cost function comes to aid. In a simple definition, Cost function evaluates how well the model (line in case of LR) fits to the training set. There are various versions of Cost function, but we will use the one below for ULR:
- m — number of examples in training set;
- h — answer of hypothesis;
- y — y values of points in the dataset.
The optimization level of the model is related with the value of Cost function. The smaller the value is, the better the model is. Why? The answer is simple — Cost is equal to the sum of the squared differences between value of the hypothesis and y. If all the points were on the line, there will not be any difference and answer would be zero. To put it another way, if the points were far away from the line, the answer would be very large number. To sum up, the aim is to make it as small as possible.
So, from this point, we will try to minimize the value of the Cost function.
In order to get proper intuition about Gradient Descent algorithm let’s first look at some graphs.
This is dependence graph of Cost function from theta. As mentioned above, the optimal solution is when the value of Cost function is minimum. In Univariate Linear Regression the graph of Cost function is always parabola and the solution is the minima.
Gradient Descent is the algorithm such that it finds the minima:
- α — learning rate;
The equation may seem a little bit confusing, so let’s go over step by step.
- What is this symbol — ‘:=’?
- Firstly, it is not same as ‘=’. ‘:=’ means “to update the left side value”, here it is not possible to use ‘=’ mathematically, because a number cannot be equal to subtraction of itself and something else (zero is an exception in this case).
2. What is ‘j’?
- ‘j’ is related to the number of features in the dataset. In Univariate Linear Regression there is only one feature and j is equal to 2. ‘j’ = number of features + 1.
3. What is ‘alpha’?
- ‘alpha’ is learning rate. Its value is usually between 0.001 and 0.1 and it is a positive number. If it is high the algorithm may ‘jump’ over the minima and diverge from solution. If it is low the convergence will be slow. In most cases several instances of ‘alpha’ is tired and the best one is picked.
4. The term of partial derivative.
- Cost function mentioned above:
- Cost function with definition of h(x) substituted:
- Derivative of Cost function:
5. Why is derivative used and sing before alpha is negative?
- The answer of the derivative is the slope. The example graphs below show why derivate is so useful to find the minima.
In the first graph above, the slope — derivative is positive. As is seen, the interception point of line and parabola should move towards left in order to reach optima. For that, the X value(theta) should decrease. Now let’s remember the equation of the Gradient descent — alpha is positive, derivative is positive (for this example) and the sign in front is negative. Overall the value is negative and theta will be decreased.
In the second example, the slope — derivative is negative. As is seen, the interception point of line and parabola should move towards right in order to reach optima. For that, the X value(theta) should increase. Now let’s remember the equation of the Gradient descent — alpha is positive, derivative is negative (for this example) and the sign in front is negative. Overall the value is positive and theta will be increased.
The coming section will be about Multivariate Linear Regression.