Linear Regression in ‘Simple English’
In this article, we’ll go over understanding simple linear regression in layman's terms using less than 1000 words. It’s easy to drown in the ocean of different variables, equations, metrics, and evaluation formulas when you start with understanding what Linear Regression is. Hopefully, at the end of this article, you’ll be able to understand how Linear Regression works and use it in your datasets.
‘Linear Regression’ literally means the relationship between the value of one variable and some corresponding values of other variables defined over a straight line.
We use simple linear regression when working with quantitative variables. It essentially is used to define a relationship. The variables we use can be completely unrelated like the height of a person and their salary (if our data is about salaried individuals) or they can be things such as the number of pencils produced by a machine and the number of hours the machine was running, which seems pretty related. The idea is to establish statistically, an equation over a straight line that can help you quantify this relationship.
This established relationship will then help us ascertain a close approximation of the dependent variable for a given value of the independent variable. In other words, the main reason why linear regression is used is to predict a continuous value.
For example, if this was how our data looked with x (independent variable) and y (dependent variable)
We could, just by seeing the data, say that there's an upward trend where our y values tend to increase with an increase in the x values. If this was to be defined using a straight line, the line could look like this.
The line can then be extended furthermore and it can help us predict/ascertain the values for y, given the new values of x such as x = 15 or x = 20.
This is pretty much how simple linear regression works, only that now we need to quantify it in a linear equation of 2 variables.
Let us see how the simple linear regression equation looks like and then break it down;
y = α + βx + ε
y is the dependent variable
x is the independent variable
α is the y-intercept
β is the slope of the straight line
ε is the error
Understanding each part now;
The dependent variable is what we are trying to model as the function of another independent variable. If you consider our previous example with the machine, the no. of units produced by the machine (y) can be a function of the no. of hours the machine (x) was functional.
The independent variable is what impacts the value of the dependent variable and this can be modeled over a straight line in simple linear regression.
The y-intercept tells us that on the XY plane, where does our straight line equation meet the Y-axis. This is also known as the constant in the equation which does not change for any given value of x.
The slope of the equation tells us the level of the incline/decline our straight line has horizontally. It can be an upward sloping line (slope > 0) or a flat line (slope = 0) or a downward sloping line (slope <0). The value of β also gives information about the magnitude of the slope, a higher positive value represents a higher positive or upward incline and vice versa.
ε refers to the error value for a given term to accurately give the position of the observed value against the prediction.
Now we know what regression is and why we use it. We also know what the equation looks like and what it means.
It is equally important to know ‘how’ any of this helps us. What does the regression equation tell us?
For this let’s take an example equation and break it down.
Suppose, our equation is:
y = 1 + 3x
(we don't consider an error here for oversimplification)
Here, the y-intercept is 1 and the slope of the straight line is 3 (positive slope showing an upward incline)
Unlike the y-intercept, the slope here tells us exactly what we need to know about the relationship of the two variables in consideration.
Slope works on a “rise over run” basis.
In our example, since the slope is 3, it can be written as 3/1 and can be used to interpret that a 1 unit increase in the independent variable (x) causes an increase of 3 units in the dependent variable (y). If the slope was -3, it would indicate that there is a decrease of 3 units in y for a given 1 unit decrease in x. These predictions are only bound to our dataset and cannot be applied to a similar situation elsewhere outside the data as it can lead to misrepresentation of the predictions.
The error term comes into the picture when our prediction of y is different from the observed value of y and the difference is added to the regression equation as ε.
A study of the sum of errors, the squared sum of errors, etc. is taken as metrics in determining how well our regression line fits the given data. The lower the error, the better our model performs. We can add independent variables to our regression in order to reduce the error.
For more information on linear regression:
For more on interpreting regression lines: https://www.dummies.com/article/academics-the-arts/math/statistics/how-to-interpret-a-regression-line-169717
For more on error metrics: