How to run and interpret MULTIPLE regression models in R: quick guide with real-world economic data

Dima Diachkov
Data And Beyond
Published in
11 min readOct 24, 2023

Warm welcome, dear reader. Last week we tackled simple regression (previous article), trying to capture the effect of one variable on the other. Now, we will look at multiple regression, as I promised, which is designed for more intricate situations with more than two variables.

Again, you really don’t need to be a statistics guru to run or interpret a multiple regression model. Multiple regression is an extension of simple linear regression and allows us to understand how multiple independent variables affect a dependent variable. It’s widely used in various fields like finance, healthcare, and machine learning.

I. Introduction: Embracing the Multiple Regression Approach

Just as a multitude of factors shape our decisions in life, multiple regression allows us to consider various predictors (independent variables) to better understand and predict our dependent variable. Such a technique is indispensable across multiple domains including economics, medicine, social science, and more, offering nuanced insights into the multi-dimensional nature of phenomena.

To illustrate, imagine predicting a country’s GDP. We would likely consider numerous variables like inflation, employment rate, and perhaps even societal factors, all simultaneously shaping GDP dynamics. This complexity is where multiple regression steps in, lending its power to capture and analyze these simultaneous influences.

Here’s a glance at multiple regression’s omnipresence across fields:

  1. Predictive Modeling: To predict future outcomes based on multiple variables.
  2. Risk Assessment: In finance, to assess the risk associated with different investment options.
  3. Marketing Analytics: To understand how different factors like price, advertising, and seasonality affect sales.
  4. Healthcare: To predict patient outcomes based on multiple health indicators.
  5. Machine Learning: As a foundational algorithm in supervised learning.

But this list is not exhaustive! Let’s talk about how it works and how different it is from simple linear regression.

II. Understanding Multiple Regression: What is it Exactly?

Multiple regression extends simple linear regression by considering more than one independent variable to predict the dependent variable, accommodating the real world’s multifaceted nature. Simple as that.

For more complex scenarios involving multiple independent variables, the formula expands into:

Multiple Linear Regression Formula:

where:

LaTex-formatted comment to the components

And still and always there is this ϵ, which is the error term.

The model operates under several assumptions. First, it assumes a linear relationship between the dependent and independent variables. Second, the observations should be independent of each other. Third, the model assumes homoscedasticity, meaning the variance of the errors is constant across all levels of the independent variables. Fourth, for any fixed value of the independent variables, Y is assumed to be normally distributed. Finally, the model assumes that there is no perfect multicollinearity among the independent variables.

The coefficients in the model are estimated using the method of least squares. This method minimizes the sum of the squared differences between the observed and predicted values of the dependent variable. The intercept, β_0​, represents the value of Y when all independent variables are zero. The slope coefficients β_1​, β_2​, …, β_n​, indicate the change in Y for a one-unit change in the corresponding independent variable, while holding all other variables constant.

To assess the fit of the model, we often look at measures like R squared and the adjusted R squared. R squared represents the proportion of the variance for the dependent variable that’s explained by the independent variables in the model. The adjusted R squared provides a more accurate measure of goodness-of-fit by adjusting R squared based on the number of predictors in the model. Hypothesis tests such as the F-test and t-tests are used to test the significance of the overall model and individual coefficients, respectively.

Tip: Model selection techniques like stepwise regression and regularization methods like Lasso and Ridge regression can be used to refine the model. These methods help in the automatic selection of predictive variables and in preventing overfitting, respectively.

However, multiple regression comes with its own set of risks and caveats. Overfitting is a common issue when too many variables are included, as the model may fit the noise rather than the underlying trend. Also, it’s crucial to remember that multiple regression can indicate correlation but not causation. Another risk is endogeneity, which occurs when an explanatory variable is correlated with the error term, leading to biased estimates.

However, multiple regression is a very versatile and powerful tool that is widely used in various fields ranging from economics and biology to engineering. It serves as a foundational technique for both predictive modeling and hypothesis testing. Let’s see how to do one ourselves.

III. Hands-on part. Getting data

To test our newly acquired knowledge, we can run our multiple regression based on the information about GDP, inflation, and unemployment in European countries. Last time we concluded that inflation is not affecting GDP. This time I am going to add information about country-level ranking on corruption (lower value is better).

In one of my previous articles, I provided a ready-made script to parse data online from the Trading Economics website (check this link to the article and GitHub source code).

library(dplyr)
library(ggplot2)

parse_tradecon_table <- function(link = "", table_number = 1, indicator_name = "value")
{
# we call the package inside of the function so it is called automatically every time you use the function
library(rvest)
library(dplyr)

# here we check the provided link for being non-empty string
if(link == "")
{stop("No URL provided")}

# then we try to parse the URL, but if it fails - we print error message and stop function
try(parsed_data <- read_html(link), stop("Something went wrong...Please, check the link you provided."))
try(parsed_table <- html_table(parsed_data), stop("Something went wrong...Seems like there are no tables available."))
try(df <- as.data.frame(parsed_table[[table_number]]), stop(paste0("Something went wrong...Seems like the link does not have table number ",table_number, " or any tables at all")))

output_df <- df %>%
select(Country, Last) %>%
rename(!!indicator_name := Last, country = Country)

return(output_df)
}

Now we have to feed links to webpages with data- Here is the abstract that we need to obtain ad-hoc most actual (Aug23-Sep23) macro data.

# Feeding datasets from the web
infl_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/harmonised-inflation-rate-yoy?continent=europe", indicator_name = "inflation")
unemp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/employment-change?continent=europe", indicator_name = "unemployment")
gdp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/gdp-annual-growth-rate?continent=europe", indicator_name = "gdp_growth")
corruption_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/corruption-rank?continent=europe", indicator_name = "corruption")

# Merging datasets into one
merged_df <- infl_df %>%
full_join(unemp_df, by = c("country")) %>%
full_join(gdp_df, by = c("country")) %>%
full_join(corruption_df, by = c("country"))

european_union <- c("Austria","Belgium","Bulgaria","Croatia","Cyprus",
"Czech Republic","Denmark","Estonia","Finland","France",
"Germany","Greece","Hungary","Ireland","Italy","Latvia",
"Lithuania","Luxembourg","Malta","Netherlands","Poland",
"Portugal","Romania","Slovakia","Slovenia","Spain",
"Sweden")

merged_df$eu_country <- factor(ifelse(merged_df$country %in% european_union, "EU-countries", "Other countries"))

# Final dataset, dedicated to EU data only
data <- merged_df %>% filter(eu_country == "EU-countries")

These scripts will give us this data frame (below), stored in data in R.

Contents of the data object

So here is the data. Now we can run our first model.

IV. Some checks before we run regression

In the quest to understand the dynamics of GDP growth, we’ve considered multiple variables: inflation, unemployment, and corruption. A correlation matrix could be the first step in this exploratory analysis.

# Let's define the variables of interest
interesting_vars <- c(“inflation”, “unemployment”, “gdp_growth”, “corruption”)

# Let's build a correlation matrix to get a view on co-dependency
cor_matrix <- cor(data[, interesting_vars], use = "complete.obs")
print(cor_matrix)
Output of the correlation matrix

Interestingly, inflation and corruption showed a strong positive correlation of approximately ~0.70. On the other hand, GDP growth had a moderate positive correlation with unemployment (~0.32) and corruption (~0.40), but a slight negative correlation with inflation.

To visualize these relationships, scatter plots can be generated for each pair of variables. The plots further can confirm the observed correlations, providing a graphical representation of how each variable might relate to the others.

# Visual analysis
pairs(data[, interesting_vars], pch = 21, bg = c("red"))
Output of visual approach to correlation

Tip: Before proceeding to multiple regression, it’s crucial to check for multicollinearity among the independent variables.

The Variance Inflation Factor (VIF) can be calculated for each variable in the model in order to verify if we must be concerned with correlation. All VIF values were well below the commonly used threshold of 5, indicating that multicollinearity is not a concern in this model. This code can be used to do that.

library(car)

vif_model <- lm(gdp_growth ~ inflation + unemployment + corruption, data = data)
vif(vif_model)
Output for the code above

So data is more or less tolerable for the model. Finally, we can run it.

V. Running & interpreting the model

Let’s run lm() function with all variables in it as explanatory variables, fed after tilda (~) sign. Simple as that. The same way we did the one simple regression model.


# Run regression
model <- lm(gdp_growth ~ inflation + unemployment + corruption, data=data)

# Displaying the model summary
summary(model)
Intercept score

The intercept is 0.08186, but it’s not statistically significant (p-value = 0.878). This means that when all the independent variables are zero, the expected GDP growth rate is approximately 0.082, although this is not a meaningful interpretation given the high p-value.

Inflation, Unemployment, and Corruption
  1. Inflation: The coefficient is -0.38757 and is significant at the 0.01 level (p-value = 0.004270). This suggests that a one-unit increase in inflation is associated with a 0.388 decrease in GDP growth, holding other variables constant.
  2. Unemployment: The coefficient is 0.77239 and is marginally significant (p-value = 0.064249). This indicates that a one-unit increase in unemployment is associated with a 0.772 increase in GDP growth, although this is not statistically significant at the conventional 0.05 level.
  3. Corruption: The coefficient is 0.06622 and is highly significant (p-value = 0.000295). Given that the corruption data is based on rankings where a lower value is better, this positive coefficient actually implies that a worse corruption ranking (i.e., a decrease in the ranking number) is associated with an increase in GDP growth, holding other variables constant. A little frustrating, but can be explained economically.

Now let’s interpret other output items, which describe the model in general.

The residual standard error is 1.247, which gives an idea of the average distance that the observed values fall from the regression line. The R-squared value is 0.4986, indicating that approximately 49.86% of the variability in GDP growth is explained by the model. Almost half! Not perfect, but still something. The adjusted R-squared is 0.4332, which takes into account the number of predictors in the model and provides a more accurate measure of goodness-of-fit. Less than half, but still not too bad for the toy example. The F-statistic is 7.624 with a p-value of 0.001033, indicating that at least one predictor variable is statistically significant in explaining the variability in GDP growth.

Our model in general looks like the following: The model suggests that inflation negatively impacts GDP growth, while unemployment has a marginally positive impact. Importantly, a worsening in the corruption ranking is associated with an increase in GDP growth. The model is statistically significant but explains less than 50% of the variance in GDP growth, suggesting that other factors may also be influential. This model can be fine-tuned, but from the economic perspective, it is already great.

Creating a visualization for a multiple regression model can be a bit complex since it involves multiple dimensions. However, a common practice is to visualize the relationships between the dependent variable and each independent variable separately, while holding other variables constant. We will create a generic solution that will allow you to plot any set of charts.

# Load necessary library
library(ggplot2)

# Create a function to plot each variable
plot_var <- function(var_name) {
# Create a scatter plot and regression line for a specific variable
p <- ggplot(data, aes_string(x = var_name, y = 'gdp_growth')) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, col = 'red') +
labs(title = paste('Relationship between', var_name, 'and GDP Growth'),
x = var_name,
y = 'GDP Growth')

return(p)
}

# List of independent variables
independent_vars <- names(data)[names(data) %in% c('inflation', 'unemployment', 'corruption')]

# Apply the function to each independent variable
lapply(independent_vars, plot_var)

Here are our three charts:

These charts visualize the same pattern we identified on the linear regression model level. Inflation is almost irrelevant to GDP growth, while employment and corruption are relevant. Of course, this is a toy example and this output can not be used for practical inference, as it is based on a small sample and only for one time period. Deeper analysis is required for real-world applications.

VI. Summary of regression

The analysis suggests that inflation negatively impacts GDP growth, while unemployment and higher ranking in corruption ratings have a positive impact. The positive relationship between low ranking in corruption and GDP growth is not intuitive.

The positive correlation between corruption rankings and GDP growth in your model could be due to various factors. My idea: it’s possible that corruption facilitates short-term economic activity by bypassing bureaucratic delays, or that high-growth economies attract more corruption rather than corruption causing growth. The model might also be missing key variables that explain this relationship, or the data could be flawed or incomplete. Additionally, cultural factors and statistical anomalies could be influencing the results. Further investigation is needed to validate these counterintuitive findings. But that could become a topic for the next articles.

It’s also worth noting that while the models are statistically significant, they explain less than half of the variance in GDP growth, indicating that other factors not included in the model could also be influencing GDP growth.

Given these findings, policymakers and economists should consider the complex interplay between these variables when designing economic policies. Further research could also explore the inclusion of additional variables and the use of more complex models to better understand the dynamics of GDP growth.

Conclusion

Multiple regression analysis is a powerful tool for understanding complex relationships between variables and now we know how to run it, interpret and we get the theory behind it. Please keep in mind, that it’s crucial to consider both statistical and practical significance when interpreting models.

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

Author’s AI-generated image

--

--

Dima Diachkov
Data And Beyond

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python