How to run and interpret simple regression models in R

Published in

Data And Beyond

8 min readOct 7, 2023

I want to say it right away: you don’t need to be a statistics, math, or econometrics professional to run or interpret a regression, or maybe even use it at work. Please, read my thoughts about the most fundamental regression technique in R, and you will see on your own that basic linear regressions in R can be used even by non-professionals to find some interesting patterns in data.

Beautiful Tile Pattern in Lisboa | Credits: Alice Butenko on Unsplash

I. Intro on the role of regression analysis in science and business

Embarking on a journey through data science often commences with a fundamental cornerstone: regression analysis. Being one of the most simple and foundational techniques in the statistician’s toolbox, regression analysis enables us to discern patterns, understand relationships among variables, and forecast outcomes in an incredibly vast array of fields, including finance, medicine, engineering, and social science, to name a few.

Let me illuminate a few key areas where regression analysis permeates through the data industry:

1. Obviously — Predictive and explanatory modeling in science: Establishing the basis for models predicting future points based on observed data or explanatory analysis in science.

2. Machine Learning Algorithms: Enhancing predictive capabilities in algorithms in supervised learning, such as Support Vector Machines (SVM) or Random Forest.

3. Time Series Analysis: Enabling forecasting in sequential data points using methods like ARIMA (AutoRegressive Integrated Moving Average).

4. Financial Forecasting: Informing models predicting future financial trends based on historical data.

5. Business Analytics: Aiding marketers in predicting sales and assessing campaign impacts.

6. Healthcare Research: Facilitating the analysis and prediction of disease progression and outcomes in clinical studies.

Regression analysis is an acknowledged technique. Currently, it is embedded within various technologies and tools like R, Python, SPSS, and even Excel.

Therefore, our exploration into R and regression analysis is not just a statistical expedition but a quest into the analytical heart of data science, empowering us to navigate through the data-driven decision-making realms with finesse and insight.

II. Why Does Regression Analysis Matter? How it works?

Regression analysis, at its core, seeks to quantify the relationship between variables. In a simplistic model involving two variables, it endeavors to predict the dependent variable (response) by using the independent variable (predictor). This could be visualized as fitting the best possible line through a scatterplot of data points, which is commonly referred to as the regression line.

This helps us with:

Predictive capabilities, such as predicting stock prices, rainfall, or test scores, where regression allows us to estimate the dependent variable based on the independent variable(s).
Understanding relationships via deciphering how variables are interconnected and how a change in one might instigate shifts in others.
Evaluating trends with detecting and quantifying trends within datasets aids in making informed decisions and predictions.

Let’s explore the foundational formula behind simple linear regression analysis, which is one of the most basic and commonly utilized forms of regression. I will use a free online LaTex editor to present formulas.

Simple Linear Regression Formula:

Here’s a breakdown of the components.

LaTex-formatted comment to the components

In a practical context, let’s assume we are trying to predict a student’s score (Y) based on the number of hours they study (X). The regression in conceptional terms in LaTex would look like this. Here:

So, the objective of regression analysis is to find the best-fitting line through the data points that minimizes the sum of squared residuals (differences between observed and predicted values). This is achieved using various optimization techniques, such as the method of least squares in the context of linear regression.

Let’s make our hands dirty and do some practice.

III. Hands-on parts. Run a dummy regression model

To test our newly acquired knowledge, we can run the regression based on the information about GDP, inflation, and unemployment in European countries. In one of my previous articles, I provided a ready-made script to parse data online from the Trading Economics website (check this link to the article and GitHub source code).

Here is the abstract that we need to obtain ad-hoc most actual (Aug23-Sep23) macro data.

parse_tradecon_table <- function(link = "", table_number = 1, indicator_name = "value")
{
  # we call the package inside of the function so it is called automatically every time you use the function
  library(rvest)
  library(dplyr)
  
  # here we check the provided link for being non-empty string
  if(link == "")
  {stop("No URL provided")}
  
  # then we try to parse the URL, but if it fails - we print error message and stop function
  try(parsed_data <- read_html(link), stop("Something went wrong...Please, check the link you provided."))
  try(parsed_table <- html_table(parsed_data), stop("Something went wrong...Seems like there are no tables available."))
  try(df <- as.data.frame(parsed_table[[table_number]]), stop(paste0("Something went wrong...Seems like the link does not have table number ",table_number, " or any tables at all")))
  
  output_df <- df %>% 
    select(Country, Last) %>%
    rename(!!indicator_name := Last, country = Country)
  
  return(output_df)
}

infl_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/inflation-rate?continent=europe", indicator_name = "inflation")
unemp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/unemployment-rate?continent=europe", indicator_name = "unemployment")
gdp_df <- parse_tradecon_table("https://tradingeconomics.com/country-list/gdp-annual-growth-rate?continent=europe", indicator_name = "gdp_growth")

merged_df <- infl_df %>% 
  full_join(unemp_df, by = c("country")) %>%
  full_join(gdp_df, by = c("country"))

european_union <- c("Austria","Belgium","Bulgaria","Croatia","Cyprus",
                    "Czech Republic","Denmark","Estonia","Finland","France",
                    "Germany","Greece","Hungary","Ireland","Italy","Latvia",
                    "Lithuania","Luxembourg","Malta","Netherlands","Poland",
                    "Portugal","Romania","Slovakia","Slovenia","Spain",
                    "Sweden","United Kingdom")

merged_df$eu_country <- factor(ifelse(merged_df$country %in% european_union, "EU-countries", "Other countries"))

# Let's limit our scope to EU countries only
data <- merged_df %>% filter(eu_country == "EU-countries")

Now, we can push data to a linear regression model with a simple lm() function. Let’s try out if inflation can impact the GDP.

# Run regression
model <- lm(gdp_growth ~ inflation, data=data)

# Displaying the model summary
summary(model)

What do we see here? So many pieces of information, but which of them are really useful?

IV. Interpretation

Firstly we see the model itself (highlighted with red box)

The model predicts gdp_growth using inflation as a predictor variable, using the dataset data.

Then we see the residuals of the model.

Residuals, the differences between observed and predicted gdp_growth, are summarized here. Usually, we need to ensure that the median is close to 0 and that there are no extreme outliers (e.g. by graphically inspecting a residual plot or histogram)

These are the coefficients:

(Intercept) of 1.4060 suggests that if inflation is 0, the predicted gdp_growth is 1.4060. Straightforward. Basic algebra.
inflation of -0.1537 means that a one-unit increase in inflation is associated with a decrease of 0.1537 in gdp_growth.

Statistically speaking, significance testing is done to test the null hypothesis that each coefficient is equal to zero (no effect). To start with regression, please keep in mind that a low p-value (< 0.05) indicates that you can reject the null hypothesis (which in natural language can be translated as “if we can not prove that it is meaningless, then it probably meaningful”).

(Intercept): p = 0.0441, which is < 0.05, suggesting that the intercept is statistically significant.
inflation: p = 0.1413, suggesting that inflation is not a significant predictor at the 5% significance level.

A very good guide for Python from a Medium-colleague is here.

Next, comes residual standard error (RSE).

Basically, RSE is a measure of the average amount that the response (gdp_growth) will deviate from the true regression line. Here, typically, the actual gdp_growth will differ from the predicted value by approximately 1.588 units. Just imagine that in your head, take your time.

Then we see R^2 and Adjusted R^2.

R² (0.08133): Approximately 8.1% of the variability in gdp_growth is explained by inflation.
Adjusted R² (0.046): Adjusted for the number of predictors in the model. If it’s considerably lower than R², it’s an indicator that some of the predictors might not be adding value to the model. That is a usual drill.

Last, but not least — F-statistic and p-value

The F-statistic is a ratio of variances and is used to test the overall significance of the model. The null hypothesis is that all the regression coefficients are equal to zero. The p-value associated with the F-statistic (p = 0.1413) is not below 0.05, suggesting that the model is not statistically significant at the 5% significance level.

What did we understand from all of that? I would pay attention only to the following:

While the intercept is significant, inflation does not significantly predict gdp_growth, given the current model and data. Okay.
The R² value is quite low, indicating the model does not explain much of the variability in gdp_growth.
The model is not significantly better at explaining gdp_growth than a null model.

Look at the model graphically and do the linear regression in your head.

# Load necessary library
library(ggplot2)

# Using ggplot2 to create a scatter plot and add the regression line for 1D model
p1 <- ggplot(data, aes(x=inflation, y=gdp_growth)) +
  geom_point(size=3, col = "red") +  # Adding points, colored by 'eu_country'
  geom_smooth(method='lm', se=FALSE, col="black", linetype="solid") +  # Adding linear model line
  labs(title="Regression of GDP Growth on Inflation",
       x="Inflation",
       y="GDP Growth") +
  theme_minimal()

Yeah, there is some tendency, but very weak, and countries are very dispersed in terms of both variables. And outliers worsen the situation.

So, basically, the model is bullshit. No strong pattern in the data. But wait… Think about it in another way. With this model, we sort of proved that inflation does not define the trajectory of the GDP (of course, with billions of simplifications, limited datasets, outliers, etc.). This is a dummy example, but please keep in mind that it is okay to have insignificant models, because such models prove the absence of dependency (again, with many assumptions, etc.). This is the outcome that is useful.

Instead of conclusion

The next steps might involve exploring other variables, transforming variables, or using different modeling techniques to better predict gdp_growth.

It would be nice to see, that this topic is interesting to my readers, and therefore in the future, we would get familiar with other model assumptions (e.g., linearity, homoscedasticity, normality of residuals), other models, more complicated than this simple one.

One last boring piece of advice — consider the practical significance, not just statistical significance, when interpreting models.

Please clap 👏 and subscribe if you want to support me. Thanks!❤️‍🔥

How to run and interpret simple regression models in R

Simple Linear Regression Formula:

Instead of conclusion

Written by Dima Diachkov