Essential Math for Machine Learning: The Pearson Correlation Coefficient

Revealing the linear relationship between two variables

5 min readOct 6, 2024

This article is part of the series Essential Math for Machine Learning.

Introduction

Have you ever wondered if there’s a connection between the number of hours you study and the grades you get? Or perhaps between the amount of ice cream sold and the number of sunglasses purchased? These are questions about correlation, and thankfully, we have a powerful tool to help us find the answers: the Pearson Correlation Coefficient.

Let’s dive in with a real-world example. Imagine you’re a coffee shop owner, and you suspect there’s a relationship between the daily temperature and the number of iced coffees sold. You collect data for ten days:

Day  | Temperature (°C) | Iced Coffees Sold
-----|------------------|-------------------
  1  |        20        |         15
  2  |        22        |         22
  3  |        25        |         30
  4  |        21        |         18
  5  |        28        |         35
  6  |        18        |         12
  7  |        26        |         32
  8  |        24        |         28
  9  |        29        |         38
 10  |        23        |         25

By simply glancing at the data, you might notice that as the temperature rises, the number of iced coffees sold tends to increase as well. But how can we quantify this relationship? This is where the Pearson Correlation Coefficient comes in.

Introducing the Pearson Correlation Coefficient

The Pearson Correlation Coefficient (often denoted by ‘r’) is a statistical measure that indicates the strength and direction of a linear relationship between two variables. It provides a value between -1 and 1, where:

1 represents a perfect positive correlation (as one variable increases, the other increases proportionally).
-1 represents a perfect negative correlation (as one variable increases, the other decreases proportionally).
0 indicates no linear correlation between the variables.

  Coefficient Range  | Strength      | Direction 
---------------------------------------------------
   0.9 to 1.0        | Very Strong   |  Positive  |
   0.7 to 0.9        | Strong        |  Positive  |
   0.5 to 0.7        | Moderate      |  Positive  |
   0.3 to 0.5        | Weak          |  Positive  |
   0.0 to 0.3        | Very Weak     |  Positive  |
   0.0 to -0.3       | Very Weak     |  Negative  |
  -0.3 to -0.5       | Weak          |  Negative  |
  -0.5 to -0.7       | Moderate      |  Negative  |
  -0.7 to -0.9       | Strong        |  Negative  |
  -0.9 to -1.0       | Very Strong   |  Negative  |

Diving into the Math

The Pearson Correlation Coefficient is calculated using the following formula:

r = (Σ[(xi - x̄)(yi - ȳ)]) / (√[Σ(xi - x̄)²] * √[Σ(yi - ȳ)²])

Where:

xi and yi are individual data points in the two variables (temperature and iced coffees sold in our example).
x̄ is the mean (average) of the x values.
ȳ is the mean (average) of the y values.
Σ denotes the sum of the values.

To understand this formula, let’s break it down into two key components:

Covariance: The numerator, Σ[(xi - x̄)(yi - ȳ)], represents the covariance between the two variables. Covariance measures how much two variables change together. A positive covariance indicates that they tend to move in the same direction (when one increases, the other tends to increase), while a negative covariance indicates they tend to move in opposite directions.
Standard Deviation: The denominator, (√[Σ(xi - x̄)²] * √[Σ(yi - ȳ)²]), is the product of the standard deviations of the two variables. Standard deviation measures the amount of variation or dispersion in a set of values.

Essentially, the Pearson Correlation Coefficient is the covariance divided by the product of the standard deviations. This normalization by the standard deviations scales the covariance to a value between -1 and 1,

Why closer to 1 or -1 means stronger correlation

Think of it this way:

When two variables have a strong positive relationship, their deviations from the mean tend to be in the same direction. If xi is above its mean, yi is likely above its mean as well. This leads to positive products (xi - x̄)(yi - ȳ) and a larger numerator in the formula.
Similarly, for a strong negative relationship, the deviations tend to be in opposite directions. If xi is above its mean, yi is likely below its mean. This leads to negative products and a larger negative numerator.
In both cases, the stronger the relationship, the more consistent these deviations are, leading to a larger absolute value for the numerator. Since the denominator is always positive (it’s based on standard deviations), the overall coefficient r gets closer to 1 or -1.

Let’s look at a few examples to illustrate this:

Example 1: Strong Positive Correlation

X = [1, 2, 3]
Y = [2, 4, 6]

As we saw earlier, this results in r = 1.

Example 2: Strong Negative Correlation

X = [1, 2, 3]
Y = [6, 4, 2]

If you calculate r for this example, you'll get r = -1.

Example 3: Weak Correlation

X = [1, 2, 3]
Y = [5, 2, 4]

Here, the relationship between X and Y is less clear. Calculating r will give you a value close to 0.14, indicating a weak positive correlation.

Calculating the Pearson Correlation Coefficient in Python

Here’s how you can calculate the Pearson Correlation Coefficient using Python without relying on any external libraries. The code is available in this colab notebook.

def pearson_correlation(x, y):
  """
  Calculates the Pearson correlation coefficient between two lists of numbers.

  Args:
    x: The first list of numbers.
    y: The second list of numbers.

  Returns:
    The Pearson correlation coefficient between x and y.
  """

  n = len(x)
  sum_x = sum(x)
  sum_y = sum(y)
  sum_x_sq = sum([x_i**2 for x_i in x])
  sum_y_sq = sum([y_i**2 for y_i in y])
  sum_xy = sum([x[i] * y[i] for i in range(n)])

  numerator = n * sum_xy - sum_x * sum_y
  denominator = ((n * sum_x_sq - sum_x**2) * (n * sum_y_sq - sum_y**2))**0.5

  if denominator == 0:
    return 0  # Handle the case of zero variance
  else:
    return numerator / denominator

# Example usage with our coffee shop data:
temperature = [20, 22, 25, 21, 28, 18, 26, 24, 29, 23]
iced_coffees_sold = [15, 22, 30, 18, 35, 12, 32, 28, 38, 25]

correlation = pearson_correlation(temperature, iced_coffees_sold)
print(f"Pearson Correlation Coefficient for coffee shop data: {correlation}") 

# Example usage with the examples above:
X1 = [1, 2, 3]
Y1 = [2, 4, 6]
correlation1 = pearson_correlation(X1, Y1)
print(f"Pearson Correlation Coefficient for Example 1: {correlation1}") 

X2 = [1, 2, 3]
Y2 = [6, 4, 2]
correlation2 = pearson_correlation(X2, Y2)
print(f"Pearson Correlation Coefficient for Example 2: {correlation2}") 

X3 = [1, 2, 3]
Y3 = [5, 2, 4]
correlation3 = pearson_correlation(X3, Y3)
print(f"Pearson Correlation Coefficient for Example 3: {correlation3}")

Output:

Pearson Correlation Coefficient for coffee shop data: 0.9916327322387289
Pearson Correlation Coefficient for Example 1: 1.0
Pearson Correlation Coefficient for Example 2: -1.0
Pearson Correlation Coefficient for Example 3: -0.3273268353539886

Takeaways

The Pearson Correlation Coefficient helps us quantify the linear relationship between two variables.
It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear relationship.
The formula measures how consistently the two variables deviate from their means in the same direction (positive correlation) or opposite directions (negative correlation).
Remember that correlation doesn’t equal causation! A strong correlation doesn’t necessarily mean one variable causes the other to change.