Regression and Correlation Analysis: Understanding Relationships in Data

Hamton Wicaksono
4 min readSep 1, 2023

--

A. Regression Inference Analysis

Regression is a statistical analysis approach for determining the relationship between one or more independent variables (x) and one or more dependent variables (y). Independent variables are used in regression to predict or explain changes in the dependent variable.

Example of a two-variable correlation graph. Source: diplayr.com

Example :
A researcher wishes to learn more about the association between height (independent variable) and weight (dependent variable) in a sample of people. The researcher obtains height and weight data from this group and does regression analysis to see if there is a link between the two variables. If there is a significant positive connection between height and weight, the researcher can use regression to estimate a person’s weight based on their height.

Regression is frequently used in many sectors, including business, finance, and economics. It can, for example, be used in e-commerce to forecast sales based on factors such as pricing and marketing activities.

The difference between correlation and regression is as follows:

a. The goal of the analysis is to identify whether or not there is a relationship between two variables and how strong that relationship is.

b. The purpose of regression is to predict the value of the dependent variable based on the value of the predictor variable.

Relationship Direction:

a. Correlation measures the strength and direction of the relationship between two variables without considering the role of each variable.

b. Regression seeks to establish a cause-and-effect link between the independent and dependent variables.

Variables:
a. Correlation involves two variables that are related to each other.

b. Regression involves at least one independent variable and one dependent variable.

Analysis Type:
a. Correlation can be done with simple correlation analysis or multiple correlation analysis.

b. Regression can be carried out with simple linear regression or multiple linear regression.

Outputs:
a. Correlation produces a correlation coefficient, which indicates the strength and direction of the relationship between two variables.

b. Regression produces a regression equation, which can be used to predict the value of the dependent variable based on the value of the independent variable.

Pearson Correlation and Coefficient Correlation

The Pearson correlation coefficient is a common type of correlation coefficient, especially when the data has a normal distribution. It assesses the linear relationship between two variables on an interval or ratio scale. It has a value between -1 (perfect and inverse relationship) and +1 (perfect and unidirectional relationship).

The Pearson correlation formula, which involves the sum of the multiplication and square of the variable being tested, is used to obtain the correlation coefficient (r).

Problem: A company wants to know whether there is a relationship between employees’ work experience and their work performance.

To calculate the correlation between work experience and employee work performance using Pearson correlation, we can use the formula:

Pearson Correlation Formula. Source: Teknikelektronika.com
with an explanation of the formula in this case as follows:

- r = Pearson’s correlation coefficient

- n = number of data pairs

- Σxy = the number of multiplication results between each pair of data on the two variables

- Σx = the sum of all values in the work experience variable

- Σy = the sum of all values in the work performance variable

- Σx² = sum of squares of each value in the work experience variable

- Σy² = sum of squares of each value in the work performance variable

Employee Data (Case Study)
Completion Method:

Σxy = (2x5) + (4x7) + (6x8) + (8x9) + (10x10) = 270
Σx = 2 + 4 + 6 + 8 + 10 = 30
Σy = 5 + 7 + 8 + 9 + 10 = 39
Σx² = (²²) + (⁴²) + (⁶²) + (⁸²) + (1⁰²) = 220
Σy² = (⁵²) + (⁷²) + (⁸²) + (⁹²) + (1⁰²) = 423

n = 5
r = (nΣxy — ΣxΣy) / sqrt([nΣx² — (Σx)²][nΣy² — (Σy)²])
= (5x270 — (30x39)) / sqrt([(5x220) — (3⁰²)][(5x423) — (3⁹²)])
= 0.984

According to the given computation, the Pearson correlation coefficient is 0.984. This demonstrates a substantial link between work experience and employee performance.

Types of Regression

a.Liner Regression: Linear regression seeks the best-fitting straight line to reflect the linear relationship between one dependent variable and one or more independent variables. It is used to simulate the linear relationship between these variables.

b. Multi Linear Regression: This idea expands to encompass several independent variables that can affect the dependent variable. It looks for the best-fitting linear equation that explains the connection between the dependent variable and several independent variables.

c. Logistic Regression: When the dependent variable is categorical or binary, logistic regression is utilized. It is used to calculate the likelihood of a given result based on the values of the independent variables.

B. Polynomial Regression & Random Forest Regression

Polynomial Regression: Polynomial regression is a regression method that uses polynomial models to model the relationship between independent variables (x) and a dependent variable (y). It can handle more complicated relationships than linear regression.

Random Forest Regression: is a machine learning approach used to predict numerical values based on several predictor factors. To make more accurate predictions, it employs an ensemble of decision trees.

The method samples data at random and constructs numerous decision trees with different predictor factors. Predictions are made from each tree for fresh observations, and the average is used to make the final forecast.

Random forest regression excels at dealing with complex and non-linear data, and it provides insights into the significance of each predictor variable.

Regression analysis is a useful technique for understanding variable relationships and creating predictions. Linear regression investigates linear relationships, whereas polynomial regression investigates more complex patterns.

Logistic regression is used to predict categorical outcomes. A machine learning technique called random forest regression uses many decision trees to predict numerical values, providing accuracy and insights. Each sort of regression has particular applications ranging from research to industry.

Reference :
Rusdianto.A., (2023), Correlation Analysis, Elearning Myskill, Accessed on 30 August 2023.

--

--