An Introduction to Linear Regression
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables)¹. Particularly, the case when there is only one explanatory variable is known as Simple Linear Regression.
We’ll define what is a linear regression, explain the assumptions behind it, and cover basic concepts about statistics to talk about performance and evaluation of linear regression models. In the end, we’ll implement two examples using the Python. Thus, it will be very handy if you have some knowledge of basic descriptive statistics, inferential statistics, and hypothesis testing. If you don’t have these skills, don’t worry, you’ll be able to understand everything.
Simple and Multiple Linear Regression
A Simple Linear Regression is a linear model with a single explanatory variable. It is used when we want to model the relationship of one variable with another. An example of a Simple Linear Regression could be trying to predict the IRA (Academic Performance Index, like the GPA) score of Brazilian college students based on their ENEM (standardized college admission test in Brazil, like the SAT) score. Or even, predicting the prices of houses only by using its size as an input to the model. Commonly, the equation that describes this kind of relationship is written as follows:
y = b₀ + b₁⋅ x₁, where b₀ is called intercept, and b₁ is the coefficient of x₁, which is the input (independent) variable.
For our IRA vs ENEM example, we could write IRA = b₀ + b₁⋅ENEM. So, once we calculate both parameters (b₁ and b₀), that looks like this:
Once we perform a linear regression, we obtain a linear equation, which is that orange line you see over the data points. With it, we can use any input (value in the x-axis) to get an estimated output (value in the y-axis).
In reality, we need more than one variable to get a good accuracy score for this kind of problem because there are many more factors that can influence a student’s performance, like the distance from their house to the university, family income, years in high school, gender, marital status, and so on. So, how could we possibly model such a problem?
A Multiple Linear Regression can be seen as a generalization of a Simple Linear Regression. It is described by the following equation:
y = b₀ + b₁⋅ x₁ + b₂⋅ x₂ +⋅⋅⋅ + bn⋅ xn, where b₀ is the intercept and b₁, b₂, …, bn are the coefficients of the independent variables.
Assumptions for Linear Regression
So far, we just talked about what a linear regression is. Still, when can we perform a linear regression? The next paragraphs aim to answer just that.
It is important to know all the regression assumptions and consider them before you perform a regression analysis. If at least one of the assumptions is violated, then your regression model is probably not the best choice to fit the data. Those assumptions are:
- Linearity
The linear regression is the simplest non-trivial relationship. It is called linear because the regression equation is linear! How can you verify if the relation between the two variables is linear?
The easiest way is to choose an independent variable and plot it against the dependent variable on a scatter plot. The linearity is checked if you obtain a result just like the plotted in the previous section.
On the other hand, examine the following plot. There is no straight line that can fit appropriately to the data points. In this case, a curved line expresses more accurately the data variability. Therefore, it is not recommended to perform a linear regression on this data.
There are easy ways to “fix” non-linearity, such as
1. Run a non-linear regression
2. Apply an exponential transformation
3. Apply a logarithmical transformation
As we are just highlighting the linear regression assumptions, we will not explore these tools here. What you need to know is that whenever you intend to perform a linear regression, try to see if it is possible. If the data cannot be fitted by a straight line, try to make a transformation and only then perform a linear regression!
2. No endogeneity
The second assumption is the so-called no endogeneity of regressors². It refers to the prohibition of a link between independent variables and the errors. It is mathematically expressed by the following equation
This equation is telling us that the error (the difference between the observed and the predicted values), denoted as ϵ, must not be correlated with an independent variable, x. This is a problem referred to as omitted variable bias. This problem arises when we forget to include a crucial variable in the model. Consider the following case:
the dependent variable, y, is explained by the 𝑥 variable and also by a x* variable that was omitted in the model. Chances are that 𝑥 is correlated to x*. However, we did not include it as a regressor. Everything that is not explained by the model goes into the error. So, actually, the error becomes correlated with everything else.
Let’s try another example. Suppose we wish to predict the price, 𝑦, of an apartment building, in central New York City by using only its size, 𝑥, as a regressor. Then, we obtained the following equation from the model:
This result is extremely counterintuitive as we expect that the higher the price, the more expensive the apartment. Therefore, as x (the apartment size) increases, its value decreases. This implies that the covariance between the independent variable and the error is different from zero. Critical thinking time! you can ask yourself questions like:
1. Where do I draw the sample from?
2. Can I get a better sample?
3. Why is bigger real estate cheaper?
Let’s consider the following: The sample comprises apartment buildings in central New York and is large enough so the problem is not with the sample.
Wait! Remember that we are in New York! The place with some of the most valuable real estate in the world! we omitted the exact location as a variable. In our particular example, the million dollar suits in the city of New York turned things around. After we add this variable we may get a result like:
Although the numbers are purely hypothetical, the result looks intuitive! As one would expect, the larger the property, the more expensive it is.
The lesson here is: advanced knowledge in the subject is helpful! If you do not understand your problem, you can fall into the endogeneity problem.
3. Normality and Homoscedasticity
The third assumption can be expressed by the equation 𝜖 ∼ 𝑁(0, 𝜎²). Derived from the equation, the third assumption comprises three parts:
Normality: We assume the error term is normally distributed. Normal distribution is not required for making a regression but for making inferences. Consider that the t-statistic and the F-statistic, which will be explored later, apply only when the error is normally distributed.
Note that a possible solution for this problem is to apply the central limit theorem.
Zero Mean: If the error mean is not expected to be zero, then a line will not yield a good fitting. However, having an intercept solves that problem.
So, in real life, it is unusual to violate this part of the assumption.
Homoscedasticity: This part means that the error must have equal variance. Put in another way,
What if there is a pattern in the error variance? We will discuss this topic in more detail as it is a bit more confusing than the others.
3.1 Homoscedasticity vs Heteroscedasticity: How to detect and visualize
We will go through an example in order to clarify what is Homoscedasticity and Heteroscedasticity. We’ll use an example of a cross-sectional study to model the number of automobile accidents by the population of towns and cities. The data is purely fictional but it correct illustrates the problem.
This example is based on the article Heteroscedasticity in Regression Analysis found in Statistics by Jim.
First, it is important to plot the data and see how it actually looks like:
If we perform a linear regression to the data, our output will be like the following:
In this plot, the data spread as we look from left to right in the x-axis. Following the regression (orange) line, it becomes clear that the error is increasing as the population increases. This is a clear signal of Heteroscedasticity.
Another way to check for Heteroscedasticity is by assessing the residuals by the fitted value plot. Heteroscedasticity produces a distinctive fan or cone shape in residual plots. The following plot provides us the information we need. Again, we can see that the error variance is different from zero.
3.2 Fixing heteroscedasticity
There are several ways to perform data transformation in order to achieve Homoscedasticity and get better regressions. We will not study all of them here as they are not in the scope of this post. If you wish to dig a bit deeper into Heteroscedasticity, I recommend this link.
We can circumvent heteroscedasticity by checking for omitted variables. This is always a good idea! After that, you can look for outliers and try to remove them. Finally, we can apply a logarithmical transformation to the data.
For each observation in the dependent variable, calculate its natural log and then create a regression between the log of 𝑦 and the independent 𝑥’s. Conversely, you can take the 𝑥 variable that is causing the trouble and do the same. Let’s see an example.
Initially, taking into account the last regression. The population (x-axis) data seemed to be largely responsible for the heteroscedasticity. Being so, let’s apply a logarithmical transformation to the population data and check our new results:
Fantastic! Clearly, we improved our model performance. The new model is called a Semi-log model and is given by the equation:
We can also apply a transformation to the y-axis and see if there is an improvement in our model. Let’s do it!
Amazing! As you may already have realized, the transformations did not extinguish heteroscedasticity but it drove us near homoscedasticity. This model is called the log-log model and is given by:
The interpretation is: As 𝑥 increases by 1%, 𝑦 increases by b₁%.
That’s all for now. As I said, there are many ways to fix heteroscedasticity, here we saw one of the simplest but not much effective solution. I encourage you to study more on homoscedasticity and search for those better solutions for yourself.
4. No autocorrelation
The penultimate assumption is the no autocorrelation assumption. This is one of the most problematic properties because it cannot be relaxed. Mathematically, the no autocorrelation property means that:
In other words, errors are assumed to have no correlation. Cross-sectional data (data that were taken at one moment of time) is highly unlikely to present autocorrelation. On the other hand, it is very common in time series.
Autocorrelation roughly means that the data presents a pattern over a timespan. Linear regression does not consider that. For example, let’s plot data from the dataset of daily minimum temperatures in Melbourne, Australia, from 1891 to 1991. You can download this dataset here.
Note how the data present a pattern. Linear regression assumes that the errors are randomly spread around the regression line. That certainly will not be the case.
So, how to detect autocorrelation? A common way is to plot all the residuals on a graph and look for patterns. If you can’t find any, then you are safe.
Also, you can perform a Durbin-Watson test on your regression. Generally, its values fall between 0 and 4. 2 indicates no autocorrelation while values below 1 and above 3 are a cause for alarm.
There is no remedy for autocorrelation, the only thing you can do is to avoid the linear regression model. Alternative models are:
1. Autoregressive model
2. Moving average model
3. Autoregressive moving average model
4. Autoregressive integrated moving average model
For further reading, I recommend this article authored by Jason Brownlee.
5. No Multicollinearity
The last assumption is no multicollinearity. In Multicollinearity in Regression Analysis: Problems, Detection, and Solutions, from Statistics by Jim, it is stated that:
The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.
This statement is telling us that whenever we change one variable, the others must not be affected. Multicollinearity is just the opposite. If two variables are correlated, changing one implies in also changing the other.
We observe multicollinearity when two or more variables have a strong correlation. Mathematically,
Let’s exemplify this point with an equation. Take, 𝑦 = 3+2𝑥:
𝑦 and 𝑥 are two variables with an exact linear relationship. 𝑦 can be represented using 𝑥 and vice-versa. In a model containing 𝑦 and x, we would have perfect multicollinearity (𝜌 = 1). This imposes a big problem to our model as the coefficients will be wrongly estimated. In addition, let’s agree that if 𝑦 can be represented using 𝑥, there is no point using both!
This time, consider two variables 𝑧 and t with a correlation of 𝜌 = 0.93. If we had a regression model using 𝑧 and 𝑡 we would also have multicollinearity although not perfect. In this case, the assumption is still violated and thus implies that we may face a problem in our model.
Which problems multicollinearity can cause?
Multicollinearity can cause two main problems:
1. The variables can vary wildly depending on which variables are in the model. 2. Precision reduction of the estimated coefficients. This weakens the statistical power and representativeness of the model. In addition, the p-values might not be trusted in order to identify which variables are statistically significant.
Finally, I would like to highlight that some multicollinearity is not a problem. If we try to predict whether it will rain or not tomorrow based on the air humidity and temperature, of course, there will be some correlation degree between these variables.
How can you measure the performance of your regression?
Now that we’ve defined what a Simple and a Multiple Linear Regression is and when it can be performed, everything is set up! It’s time to discuss what measures can help us determine how powerful is our regression model.
R-squared (R²)
The R² measure is a relative measure that takes values in the range [0,1]. An R² of 0 means that your regression line explains none of the variability of the data, i. e., it does not fit well your data. Conversely, an R² of 1 implies that your regression line explains all the variability of the data. In order to understand what the R² truly is, we must dive into some mathematics.
Let’s talk about Decomposition of Variability. Based on the ANOVA framework, we have three terms we must define:
Sum of squares total (SST): is the square differences between the observed dependent variable and its mean. It is a measure of the total variability of the dataset. Mathematically,
Sum of squares regression (SSR): is the sum of the differences between the predicted value and the mean of the dependent variable. You can think of it as a measure that describes how well your line fits the data. In Mathematical terms, we have:
Sum of squares error (SSE): is the difference between the observed values and the predicted values. The SSE is given by the following equation:
In regression, our goal is to minimize the SSE.
So far, so good. But you may be wondering what is the connection between these measures? Well, explicitly SST = SSR + SSE. From this relationship, we can conclude that:
Total variability = Explained variability + Unexplained variability
Now that we have defined the important measures, let’s come back to the R² variable. The R² is defined as:
Given the fact that the R² ranges from 0 to 1, seldom an R² of 1 will show up from our model statistics. Commonly, we get an R² between 0.2 and 0.9. So, what is a good R²?
It depends! In fields such as physics and chemistry, we are usually looking for an R² greater than 0.7. However, in more subjective topics, such as social sciences, economics, and finance, an R² of 0.2 could be enough for our model.
You must consider the number of variables feeding the model and the complexity of the problem you are dealing with. For example, an R² of 0.5 means that your model is capturing 50% of the variability of the data. This can be a superb result if you are trying to predict a student’s IRA using only their ENEM score.
Perhaps, if you add the gender variable to the IRA vs ENEM model, you’ll probably get a higher R² because there is evidence that we have a gender gap in the Brazilian educational system.
Adjusted R-squared (Rₐⱼ²)
By now, you might have concluded that if you add as many variables as possible to your model, you’ll always get better and better results. Well, that’s not completely true. The Rₐⱼ², can help us determine if a variable is significant or not to our model.
The Rₐⱼ² is a coefficient that penalizes the excessive use of variables. This measure is useful to make comparisons between regression models. In addition, it’s always less than the R².
If we build a model with p independent variables and n samples, then we can calculate the Rₐⱼ² using the following equation:
Note that, unlike the R², the Rₐⱼ² can assume negative values. By the equation, we can explicitly see that if the number of variables, p, increase, and the R² remains the same or not varies significantly, then the Rₐⱼ² will decrease.
Imagine a situation when you build a model with p = 10 variables and use it in a dataset with n = 400 samples. You just calculated the R² and obtained R² = 0.410. Consequently, you got Rₐⱼ² = 0.3948.
Now you are glad because it seems a good score, so you decide to add one more variable to your model. In this case, you got an R² of 0.411. Therefore, Rₐⱼ² = 0.3943. Let’s interpret the results.
In the second situation, the R² did not change significantly, this means that the variable you added to your model didn’t increase its explanatory power. Remember that the Rₐⱼ² is there to tell you should not add that one more variable to your model? So it did!
Though the R² increased, the Rₐⱼ² decreased. This is a sign that you were penalized for adding an insignificant variable to your model.
F-statistic
Another tool that allows us to compare models is the F-statistic. This statistic is used for testing the overall significance of the model. What does that mean? Well, Suppose you wish to compare two models M₁ and M₂ with p₁ and p₂ explanatory variables, respectively. In addition, let’s assume p₂ > p₁.
Given that the M₂ model has more variables than M₁, we expect that M₂ provides us with at least a result (regression curve) as good as M₁. More often than not, we want to determine whether the second model is significantly better than the first.
In this context, we could wish to know if the model we built is better than a model with no explanatory variables. I mean, what if we wish to compare a model, M, with another model that has only the intercept as the explanatory term? Here comes the F-test!
Given its nature, we have a null and an alternative hypothesis. In an F-test, the null hypothesis is that none of our independent variables have merit in the model. Mathematically,
On the other hand, what is left is the alternative hypothesis:
The interpretation is that if the null hypothesis is true then the union of all the independent variables is not relevant to explain the variability of the data. Otherwise, if at least one independent variable is able to capture the trend of the data, then we can reject the null hypothesis. Surely, the concepts we are talking about will be more palpable once we get through the implementation of a linear regression model.
The F-statistic is the number that lets us assess the overall significance of a linear regression model. Let’s get back to our M₁ and M₂ models. Assuming we have n data points to estimate their parameters, then we can calculate the F-statistic as follows:
where SSE₁ and SSE₂ are the Sum of Squares Error of M₁ and M₂, respectively.
Under the null hypothesis that M₂ provides a better fit than M₁, F will have an F-distribution. From the formula, such distribution has (p₂-p₁, n-p₂) degrees of freedom.
Once we have the F-distribution, we can choose a false-rejection probability (a common choice is 0.05). If the F calculated from the previous formula is greater than the critical value of the F-distribution, then we reject the null hypothesis.
A python implementation
It has been a long journey to this point, but now it’s time to see everything we talked about so far in a more practical way. We’ll implement two models, one Simple Linear Regression and one Multiple Linear Regression. With the provided output, I’ll address as many topics as possible from those we’ve been discussing.
The examples we’ll be working with are based in some classes from the Udemy’s course: The Data Science Course 2019: Complete Data Science Bootcamp.
The dataset used in this section is in a .csv file with GPA and SAT scores from 84 fictitious US students. We’ll analyze this dataset and perform a linear regression on the data in order to get some information about the performance of college students and how it is correlated with their SAT score. This kind of problem is very similar to the IRA vs ENEM example that was introduced earlier.
- Simple Linear
First things first. Let’s import the relevant libraries:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
We could use the Scikit-learn library to build the model, but I think the statsmodels library suits us best. It’s simple and provides useful summary tables that we can explore later. Okay, time to load the dataset:
data = pd.read_csv('Simple linear regression.csv')
data.head()
As we are trying to figure out how good is the SAT score to predict college students GPA. By definition, our dependent variable, 𝑦, is the GPA score and the independent variable, 𝑥₁, is the SAT score. So, I’ll split the dataset into these two classes.
y = data['GPA'] # dependent variable
x1 = data['SAT'] # independent variable
Let’s verify the linearity of the data and if other assumptions are violated or not by plotting the data:
plt.scatter(x1,y)
plt.xlabel(‘SAT’,fontsize=10)
plt.ylabel(‘GPA’,fontsize=10)
plt.show()
From the plot, we can clearly see that there is a correlation (not necessarily causation) between the SAT score and the GPA score.
None of the assumptions seem to be violated, so let’s keep going. For computational reasons, we must add a constant column of 1’s representing the coefficients of the intercept. Hence, our regression equation becomes y = b₀ ⋅ 1 + b₁ ⋅ x₁:
x = sm.add_constant(x1)
Then, we are free to perform the regression itself. In this case, I’m using the Ordinary Least Squares method:
results = sm.OLS(y,x).fit()
Lastly, let’s summarize the results:
results.summary()
Typically, every time we use statsmodels we’ll have three main tables, namely:
Model Summary Table
Coefficients Table
Additional Tests Table
The composition of all these tables is called the regression table and it provides useful insights about our model. At the Coefficients Table, look at the coef’s column. it has 2 values, these are the coefficients 𝑏₀ (const) and 𝑏₁ (SAT) of the regression model, which means that the equation that better fits our data is given by:
𝑦̂ = 0.2750+0.0017𝑥₁
Let’s draw this line over the data points and see if it’s a good prediction model:
Well, it seems pretty good. With the summary table, we can verify some statistical evidence that this is, in fact, a good model.
First, look at the R², it is equal to 0.406. This means that our model explains almost 40% of the data variability. The F-statistic also presents a good value, but as we didn’t build other models yet, we can’t use it for comparisons purposes.
Again, focusing on the Coefficients table. have a look at the p-values of the t-test for each explanatory variable. This test runs under the null hypothesis H₀: bᵢ = 0.
The p-value of the SAT variable is virtually zero, meaning that this variable is statistically significant to our model.
On the other hand, the p-value for the intercept (const) is 0.503. It does not mean that the intercept is useless to the model. This can be justified by the fact that we are only trying to capture the relationship between the SAT and the GPA score.
In other words, our model cares about the variability of the GPA based on the variability of the SAT more than anything.
Simple Linear Regression is a bit boring, don’t you think? Let’s make things funnier from now on.
2. Multiple Linear Regression
How about adding one more independent variables to our model? And what if this variable is a random variable with no explanatory power? let’s play around this dangerous path :)
data = pd.read_csv('Multiple linear regression.csv')
data.head()
The procedure to perform a multiple linear regression is pretty much the same as for the simple case. So I just put the code below:
y = data['GPA']
x1 = data[['SAT', 'Rand 1,2,3']]x = sm.add_constant(x1)results = sm.OLS(y,x).fit()
results.summary()
Let’s compare the results obtained for this model and our first model. Below, we have a table showing the R² and the Rₐⱼ² values of each model:
Look how the R² almost changed nothing in both cases, this implies that the addition of the random independent variable did not improve the explanatory power of the model. On the other hand, the Rₐⱼ² decreased in the second model, this is another evidence that the addition of the random variable is not significant to the model.
Indeed, in a more accurate analysis, if you look at the Coefficients Table it becomes clear that the random variable is useless to our model. It has a 𝑝−𝑣𝑎𝑙𝑢𝑒 = 0.762, meaning that we cannot reject the null hypothesis at a 76.2% significance level. This is incredibly high!
Let’s switch to the F-statistic:
The p-value of the F-statistic for both models is virtually zero, meaning that our model significantly explains the variability of the data. On the other hand, the F-statistic for the multiple linear regression is half the value. The lower the F-statistic, the closer to a non-significant model.
This happened because we were penalized for adding an insignificant variable to the model.
The point is: cherry pick your data when performing a linear regression. Adding new variables does not necessarily implies increasing the explainability power of your model.
That’s the end for now. Thank you very much for your attention. I hope that now you understand at least a bit more about linear regression than you did before reading this post. Also, if you want to access or download the Jupyter notebook with the notes of this post, access my GitHub at https://github.com/alexandrehsd/Regression-Analysis.
Please leave your comments and suggestions. I would love to read them :)