Simple Linear Regression — Explained

Published in

Analytics Vidhya

7 min readMar 7, 2021

Linear regression allows us to explore how changing one variable affects another. The applications are endless and spread across industries. For example, linear regression can help us gain a better understanding of the following questions:

A country’s gross domestic product increases: What is the effect on inbound tourism?
Total greenhouse gas emissions increase globally: What is the effect on the global mean temperature?
A company decreases the production of its least profitable product: What is the effect on total revenue?
A person increases their fat intake: What is the effect on their body fat percentage?

As the name implies, linear regression assumes a linear relationship between two variables. Therefore, this linear relationship can be explained with a straight line.

The chart below outlines an increase in inbound tourism over the years accompanied by an increase in the Country’s GDP. The actual data points are connected with a blue line while the grey line explains the linear relationship between the variables.

Such a line represents the slope of the relationship and tells us by how much we can expect inbound tourism in the United States to change over the years for a one-unit change in GDP.

As we are attempting to explain the change in inbound tourism given GDP, inbound tourism is our dependent variable. That is, we claim that inbound tourism depends on another variable — the change in GDP in this case. As GDP does not depend on any other variable in this simple example, it is our independent variable.

Generally speaking, the slope is expressed as βx, change as Δ, while dependent and independent variables are represented as Y and X, respectively. Hence, the change in Y for a one-unit change in X represented by the slope is formulated as:

We might wonder the following — would there be any inbound tourism if the United States had no GDP? This is an abstract scenario considering our example but sometimes such information is valuable. This hypothetical value is captured by the intercept. The intercept is expressed as β0 and tells us about the expected value of the dependent variable Y when the independent variable X has a value of 0.

Putting together what we have so far, our linear model looks like this:

The slope can take on both positive and negative values. Former would indicate a positive relationship between two variables whereas the latter would suggest the opposite.

Let’s say that the slope in our example takes on a value of 0.015. This would confirm the positive relationship between the inbound tourism and GDP we observed on the chart earlier. As GDP is expressed in trillions while tourism inbound is in billions of USD, the slope suggests that if the United States GDP increases by $1T, the Country’s inbound tourism would increase by $15B (0.015 * 1*1⁰¹² = 15 * 1⁰⁹).

However, is it possible that a country’s GDP is not the only factor that has an effect on its inbound tourism? Would the number of national parks a country has, its access to beaches, sports events, and cultural events play a factor? They sure would.

To account for all of the factors having an effect on inbound tourism other than GDP, we add another variable to our linear model. Such variable groups all of the factors other than GDP and is known as the error term. The error term is generally expressed as uᵢ.

Finally, we have the linear regression model with a single regressor, written as follows:

Where:

Y is the dependent variable or the regressand
X is the independent variable or the regressor
β0 is the intercept
β1 is the slope
uᵢ is the error term

Now that we have the linear regression model, the question is how do we estimate the straight line to depict the relationship between the independent and the dependent variables?

Looking at our chart without the line it is clear that we could draw an infinite number of straight lines through the data. But which one is the best for explaining the effect a change in GDP has on a county’s inbound tourism?

To answer this, we turn to the Ordinary Least Squares (OLS) estimator. The estimator picks the intercept and the coefficient in a way that minimizes the difference between the data points and the regression line. In other words, the line is as close to the data as possible.

As shown on the chart below, some of the data points are above the estimated line while the others are above it. Simply taking the difference between each observed and estimated data point would make them partially offset each other. This is because some differences would be positive while others would be negative. The OLS resolves the issue by measuring the closeness of the lines to the data points by squaring these differences, making all the values positive. The squared differences are then added together to give the overall estimate of a line’s closeness to the data.

In short, when doing linear regression we are estimating the value of the dependent variable Y, given the independent variable X. We do so by estimating the intercept and the slope that result in the line of best fit, i.e. the line that minimizes the squared differences between the observed and predicted values. The OLS uses the hat symbol (^) to tell the estimated from the observed values apart. Therefore, the linear model becomes:

Where the last variable represents the residual, i.e. the difference between the observed and predicted values.

After we formulated the model and established the line of best fit, how do we measure whether our model is any good? The OLS uses the regression R² and standard error to answer such a question.

The R² takes on a value between 0 and 1 and tells us how much change in the dependent variable can be explained by the change in the independent one. Generally, higher values of R² are preferable.

The standard error informs us about the spread of the observations around the regression line. The further away the observations are from the line of best fit, the less of the variation in Y can be explained by the variation in X.

Example R² and Standard error

The measures of fit from our example tell us that ~84% of the variation in the U.S. inbound tourism can be explained by the change in the Country’s GDP. The standard error is expressed in terms of the dependent variable. Therefore, the standard error of 1.33341e+10 means that there is a relatively small spread around the line of best fit, as measured in billions of USD.

Lastly, there are three OLS assumptions that have to hold for it to provide the appropriate estimators. These assumptions are:

The conditional distribution of uᵢ given Xᵢ has a mean of zero — Remember how we grouped all the other factors other than GDP that might have an effect on the country’s inbound tourism? The first assumption simply means that these other factors are not related to the independent variable, GDP in this case. In other words, no matter what value GDP takes on, the mean distribution of these other factors is zero.
(Xᵢ, Yᵢ), i=1,…,n, are independently and identically distributed — the second assumption deals with how the sample is drawn. For the OLS to provide appropriate estimators, the sample of data has to be drawn by random from a large population. Let’s say we wanted to estimate the effect age has on height. It would be valid to randomly take a sample of people of different ages and do the analysis. On the other side, purposely selecting basketball players as our sample would lead to inaccurate results.
Large outliers are unlikely — outliers are defined as observations whose values are far away from the usual range. Therefore, their existence in the data can lead to misleading outcomes. For example, we have a small dataset consisting of three employees and their salaries. The employees make $50k, $60k, and $70k, respectively. To calculate the average salary, we would simply add the three salaries and divide by the number of employees, resulting in $60k. Adding the CEO and his $300k salary to our data would drastically impact the average salary and inflate it to $120k.

The article gives an overview of the simple linear regression and explains the basic concepts through examples. The upcoming articles will tackle scenarios where we try to include numerous independent variables in an attempt to explain the change in the dependent one, that is, linear regression with multiple regressors. Meanwhile, check out the previous article if you could use a refresher on probability or are curious to learn more.

Feel free to reach out with any questions and comments on LinkedIn.

Thank you for reading.

Works Cited:

Edition, B. (2017). Introduction to Econometrics 3rd custom edition for Baruch. In Introduction to Econometrics, Third Custom Edition for Baruch College (3rd ed., pp. 109–129). Pearson Education.

Simple Linear Regression — Explained

Written by Trifunovic Uros