A Guide for Beginners
Why do we use multiple linear regression?
Multiple linear regression is the most common and most important form of regression analysis and is used to predict the outcome of a variable based on two or more independent (or explanatory) variables. Why? By using multiple regression, we can see what the relative influence of one variable is on another. For instance, we can see if the number of bathrooms in a house affects price, holding all else equal. If we only have one variable affecting our dependent variable (housing prices), we would be creating a simple linear regression model. But what if we believed that more than one thing affects housing prices (i.e. view, neighborhood, location to nearest city).
This is where multiple linear regression becomes a great tool. Multiple regression can take many independent variables into a model to explain how those independent variables affect the dependent variable. In our example above, if we wanted a model that explains our dependent variable more succinctly, we could add variables like school-district, crime rate, or another variable that may help predict your response variable (price).
Before we dive into our analysis of multiple linear regression, we must understand what we assume about our data when creating any regression model, and what our variables mean. Then we’ll explore how to understand and interpret our model.
Assumptions of multiple linear regression:
- Homogeneity of variance or homoscedasticity: this assumes that the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
2. Independence of observations — This assumption states that the observations within our data are independent of one another and have been collected through statistically valid methods. Multicollinearity: In multiple linear regression, it is possible that some of your explanatory variables are correlated with one another. This is known as multicollinearity. For example, height and weight are heavily correlated. If both these variables are used to predict sex, we should only use one of these independent variables in our model as this can create redundant information and would skew the results in our regression model.
3. Normality: We assume our data is normally distributed.
4. Linearity: Multiple linear regression requires the relationship between the independent and dependent variables to be linear.
Understanding the variables in your equation:
Below is a list of what each variable represents, as seen in the picture above:
- Y = the dependent or response variable. This is the variable you are looking to predict.
- B0 = This is the y-intercept, which is the value of y when all other parameters are set to 0 (independent variables and error term).
- B1X1= (B1) is the coefficient of the first independent variable (X1) in your model. This can be interpreted as the effect that changing the value of the independent variable has on the predicted y value, holding all else equal.
- “…” = the additional variables you have in your model.
- e = this is the model error. It explains how much variation there is in our prediction of y.
Understanding and interpreting our model:
Below, we have some data on miles per gallon for various cars. Each variable represents a different aspect of the car (i.e. horsepower, weight, acceleration). For example, the Ford Torino has a mpg of 17.0 with a weight of 3,449 pounds and was built in the United States. It is very important for us to understand our data and what it is saying before we create our model.
Now that we have familiarized ourselves with our data, let’s identify what we are looking to predict (our Y or response variable). For example, a client of yours might want to be able to predict the mpg of a car based on certain parameters. In this case, our mpg variable will be our dependent variable, and we will use other variables in our data set to help predict mpg.
Once we have identified our response variable, we now must decide which independent variables will be the best predictors of our model. I begin by creating a model with each quantitative variable in our data set. Ultimately, we want to be able to understand what variables influence our dependent variable at a statistically significant level. There will be three key factors to help us determine if we keep or remove the variable in question (adjusted r-squared, t critical value, and p-value).
To begin, let’s take a look at our coefficients to understand what they are saying. For example, every additional pound of weight a car has, holding all else equal, brings down mpg by 0.0052. A better way to interpret this is to put it in more relevant and understandable terms; as weight goes up 100 pounds, mpg goes down by 0.52. Similarly, one additional cylinder brings down mpg by 0.3979. Understanding your coefficients is very important in conveying the results of your model.
One way of understanding whether a variable should be kept in your model is by looking at the t-stat (t) value and its associated p-value (P>|t|). The p-value represents how likely it is that the t-statistic would have occurred if the null hypothesis (no relationship between independent variable and dependent variable) between the independent and dependent variables was true. If the p-value for a coefficient is below the alpha (usually set at 0.05) we can conclude that the given independent variable likely influences the dependent variable. For our purposes, both weight and horsepower seem to influence mpg.
What if our variables are correlated? Can we keep those in the model? The simple answer is no. As we stated below, an assumption (and goal) of regression analysis is to have each relationship between the independent variables and the dependent variable be independent. The idea is that you can alter the value of one independent variable and keep the other variables constant. If independent variables are in fact correlated, this indicates that shifts in one variable are associated with changes in the other variable. The higher/stronger the correlation between those two independent variables, the more difficult it becomes for the model to estimate the relationship between each explanatory variable and the response variable, due to the fact that the independent variables change in unison.
In the table below, we can see the correlation between each variable:
For the purposes of our model, we will use 0.8 as the number that indicates multicollinearity. If the correlation between two variables is above that threshold, we will remove one of the two variables from our model. As you can see, weight and horsepower are heavily correlated, so we will have to take one of those out of our model. I chose to remove horsepower as I believe weight is a stronger indicator of mpg than horsepower. Our acceleration and model_year variables are not correlated with any other variables in our data frame, so we add them to the model as well. However, the coefficient for acceleration has a value over 0.05, so we decide to take that variable out of the model. In our final model, we are left with only weight and model_year as our explanatory variables for mpg.
Notice that our adjusted r-squared is significantly higher in this model than the previous one with all of our variables. Our adjusted r-squared is stating that 80.8% of our variation in the data can be explained by our model which is greater than our previous adjusted r-squared value of 70.4%.
Wallah! You have completed your first multiple linear regression model. Although this is very basic, it is a great framework to get you started as you dive deeper into regression analysis!