Regression: Scatterplots, Correlation and Linear Regression (Part 1)

Oluwasayo Farotimi
7 min readMar 7, 2023

--

Introduction
Welcome to my first series on the topic of Regression! In this series, the goal is to share important theoretical foundational knowledge and to demonstrate how to implement regression analysis in Microsoft Excel, R, and Python with an emphasis on interpretation of parameters. Whether you are a beginner in need of a simple, step-by-step explanation of regression or an intermediate professional seeking a review of the terminologies, this is for you.

Figure 1

Review
Before we dive into regression analysis, let’s review some fundamental concepts in statistics.
There are two main areas in statistics: Descriptive Statistics and Inferential Statistics. In this series, we’ll be exploring the sub-area of Inferential Statistics that involves determining the relationship between numerical or quantitative variables, and that’s where regression analysis comes in.

Figure 2 from WentWorrth Co-op & Careers

Regression (Linear)
Regression analysis is a powerful tool in statistics used in estimating the relationship between two or more variables. There are various types of Regression Analysis, some of the various types:

  • Linear Regression
  • Polynomial Regression
  • Logistic Regression
  • Ridge Regression
  • Lasso Regression

However, Our focus is Linear Regression.

Linear Regression uses a linear approach in studying/modeling the relationship between two or more quantitative variables (a quantitative variable is a characteristic that can be measured in amounts and that can assume different values, such as height, age, income)
These quantitative variables are can be classified as:
- Dependent Variable: This is also known as the response variable or outcome variable. It is exactly as the name suggests, it is the factor being investigated, it is the outcome that we hope to find as a result of its dependence on other variables/factors.
- Independent Variable: Also known as predictor variable or explanatory variable. This variable is used in predicting the dependent variable. When we have one independent variable, it is a case of Simple Linear Regression, when we have more than one independent variable then it is a case of Multiple Linear Regression.

We use Linear regression to:
- Establish if there is a relationship between two variables
- Forecast/predict new observations. We can use our knowledge of the relationship between the variables to predict what will happen in the future.

Identifying Dependent and Independent Variables
In order to perform linear regression analysis, it is important to correctly identify the dependent and independent variables. Here are some examples.

  1. A business person may want to know whether the volume of sales for a given month is related to the amount of advertising the firm does that month
    Independent Variable (X): Amount of Advertising
    Dependent Variable (Y): Volume of Sales

2. Educators are interested in determining whether the number of hours a student studies is related to the student’s score on a particular exam.
Independent Variable (X): Number of hours spent studying
Dependent Variable (Y): Student’s Score

3) A zoologist may want to know whether the birth weight of a certain animal is related to its life span
Independent Variable (X): Birth Weight
Dependent Variable (Y): Lifespan

It is also important to point out that in cases of Simple Linear regression, our data is usually in the form below:

Figure 3. Data for example (2)

Linear Relationships and Scatterplots.
A linear relationship is one where increasing or decreasing the independent variable will cause a corresponding increase or decrease in the independent variable too.
We can examine the relationship between two variables using a scatterplot.

Figure 4 from https://datavizcatalogue.com/

A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable x and the dependent variable y. simply put, a scatter plot is a graph of two variables along their axes.

When plotting the data, we may observe different patterns that indicate the type of relationship that exists between variables.

Figure 5 from Textbook: Elementary Statistics, A step by step approach (Allan G. Bluman)

Linear relationship may be positive, negative or neither (no linear relationship). The curvilinear relationship in the image above is included to make the reader aware that we have other types of relationships which is not the focus of the series at this time.

A positive Linear relationship implies that as the independent variable increases, the dependent variable increases.A negative relationship implies that as the independent variable increases, the dependent variable decreases.For cases of no linear relationship, this means that there is no observable pattern between the variables.

After determining the type of relationship, we can then examine how strong or weak the relationship is. We can know this using the Linear correlation coefficient.

The Linear Correlation Coefficient

The linear correlation coefficient is referred to as the Pearson product moment correlation coefficient (PPMC), named after statistician Karl Pearson, who pioneered the research in this area.
The correlation coefficient can assume any value from -1 to +1.

It is possible to also infer the strength of our Linear relationship through the scatterplot although our guess may not be pinpoint accurate in some cases. We are able to make these deductions because scatterplots are directly linked to the correlation coefficient.

Figure 6 from https://www.statlect.com/fundamentals-of-probability/linear-correlation

To calculate the Linear Regression Coefficient using our data, we use the formula below

Figure 7

The value calculated using this formula can be translated thus:

  • A value of 0 indicates that there is no linear relationship.
  • Any value greater than 0 and less than or equal to 0.19 indicates a very weak positive linear relationship between the variables. Any value between 0 and -0.19 indicates a very weak negative relationship between the variables.
  • Any value greater than 0.2 and less than or equal to 0.39 indicates a weak positive linear relationship between the variables. Any value betwen -0.2 and -0.39 indicates a weak negative relationship between the variables.
  • Any value greater than 0.4 and less than or equal to 0.59 indicates a moderate positive linear relationship between the variables. Any value between -0.4 and -0.59 indicates a moderate negative relationship between the variables.
  • Any value greater than 0.6 and less than or equal to 0.79 indicates a strong positive linear relationship between the variables. Any value between -0.6 and -0.79 indicates a strong negative relationship between the variables.
  • Any value greater than or equal to 0.81 and less than 1 indicates a very strong positive linear relationship between the variables. Any value -0.81 and -1 indicates a very strong negative relationship between the variables.
  • A value of +1 indicates a perfectly positive linear relationship between the variables while a value of -1 indicates a perfectly negative linear relationship between variables

So far, we have been able determine the strength of a linear relationship between variables. Now, we will proceed to model/study this relationship using Linear Regression.

Introduction to Simple Linear Regression Model
As earlier noted, Simple Linear Regression Model has only one independent variable.

To draw a line of best fit in our scatterplot, we use a “fitted model” as it provides an accurate estimated line with least minimal deviations from our actual plots.

Figure 8 from ThougtsCo.

This fitted model uses the basic foundational principle of a linear equation, when we say fitted model, we are referring to an equation. In case you are not familiar with the linear equation, do not be alarmed.

Right now, all you need to know is that the linear equation has the form:

Figure 9 fron online4mathall

In statistics, our fitted line uses a different notation.

Figure 10

In Figure 8, you can observe that there are differences between the fitted line and the actual plotted data, these deviations can be described as the error term. The error term is the difference between the fitted line to the true value, the inclusion of the error term in our model is what completes our Simple Linear Regression Model.

Figure 11

This concludes the first part of the series.

In the next part, we will be doing a deep dive into the parameters used in our fitted line and Simple Linear Regression(SLR), I will also be sharing more on the assumptions of the SLR as well as using software such as Microsoft Excel, R and Python to compute our fitted line. Lastly, We will focus on how to interpret models.

Thank you for reading,
Stay Honed.
😊

--

--

Oluwasayo Farotimi

Simply explaining topics, research and findings in applied statistics, data science, and machine learning research in the most relatable manner.