Essential Math for Machine Learning: Spearman’s Rank Correlation

Revealing Monotonic Correlation between Two Variables

4 min readJun 12, 2024

This article is part of the series Essential Math for Machine Learning.

Introduction

When diving into the world of machine learning, understanding relationships between variables is crucial. While Pearson’s correlation is often the go-to method for linear relationships, Spearman’s rank correlation shines when dealing with monotonic (but not necessarily linear) relationships. Let’s explore what it is, when to use it, and how to apply it in Python.

What is Spearman’s Rank Correlation?

Spearman’s rank correlation, denoted by ρ (rho), measures the strength and direction of a monotonic association between two variables. Unlike Pearson, which works with the raw data values, Spearman focuses on the ranks of those values. This makes it robust to outliers and non-linear relationships.

When to Use Spearman’s Rank Correlation

Non-linear Relationships: If you suspect your data has a monotonic relationship that isn’t a straight line, Spearman is a great choice.
Ordinal Data: When working with ordinal data (e.g., survey responses like “strongly disagree,” “disagree,” etc.), Spearman is more appropriate than Pearson.
Outliers: Spearman is less sensitive to extreme values than Pearson.

The Math Behind the Scenes

While the concept of Spearman’s correlation is intuitive, understanding the underlying math helps solidify its foundation.

Ranking the Data: The first step is to convert your raw data into ranks. For each variable, assign a rank of 1 to the smallest value, 2 to the next smallest, and so on. If there are ties, assign the average rank to each tied value.
Calculating the Difference in Ranks (d): For each data point, subtract the rank of one variable from the rank of the other variable. This gives you the difference in ranks (d).
Squaring the Differences (d²): Square each of these differences (d²) to eliminate negative values and emphasize larger discrepancies.
Summing the Squared Differences: Add up all the squared differences (∑d²).
The Formula: Finally, plug the sum of squared differences and the number of data points (n) into the Spearman correlation formula.

The formula

ρ = 1 — (6 * ∑d²) / (n * (n² — 1))

The formula quantifies the degree to which the ranks of the two variables differ. Here’s how to interpret the result:

ρ = 1: A perfect positive monotonic relationship. As one variable increases in rank, the other increases as well.
ρ = -1: A perfect negative monotonic relationship. As one variable increases in rank, the other decreases.
ρ = 0: No monotonic relationship between the variables. The ranks are not associated in any particular way.
0 < ρ < 1: A positive monotonic relationship. Higher ranks in one variable tend to correspond to higher ranks in the other, but not perfectly. The closer to 1, the stronger the association.
-1 < ρ < 0: A negative monotonic relationship. Higher ranks in one variable tend to correspond to lower ranks in the other, but not perfectly. The closer to -1, the stronger the association.

Example: Coffee Consumption and Productivity

Let’s imagine a dataset where we track daily coffee consumption (in cups) and productivity levels (on a scale of 1 to 10). We suspect there’s a relationship, but it might not be perfectly linear. The code is available in this colab notebook.

import matplotlib.pyplot as plt
import numpy as np


def rank_data(data):
    """Rank the data, handling ties by assigning the average rank."""
    sorted_indices = data.argsort()
    ranks = np.zeros_like(sorted_indices)
    n = len(data)
    rank = 1
    for i in range(n):
        j = i
        while j < n and data[sorted_indices[j]] == data[sorted_indices[i]]:
            j += 1
        ranks[sorted_indices[i:j]] = (rank + j - 1) / 2
        rank = j + 1
    return ranks


def spearman_correlation(x, y):
    """Calculate Spearman's rank correlation coefficient explicitly."""
    n = len(x)
    rank_x = rank_data(x)
    rank_y = rank_data(y)
    d = rank_x - rank_y
    sum_of_squared_differences = np.sum(d**2)
    correlation = 1 - (6 * sum_of_squared_differences) / (n * (n**2 - 1))
    return correlation


# Sample Data
data = {"Coffee Cups": [1, 2, 3, 4, 5, 2, 1, 3], "Productivity": [6, 8, 7, 9, 10, 7, 5, 8]}
df = pd.DataFrame(data)

# Calculate Spearman's Rank Correlation (explicit implementation)
correlation = spearman_correlation(df["Coffee Cups"], df["Productivity"])

print(f"Spearman's rank correlation (explicit): {correlation:.3f}")

# Visualization (Scatter Plot with Trend Line)
plt.figure(figsize=(8, 6))
plt.scatter(
    df["Coffee Cups"],
    df["Productivity"],
    color="skyblue",
    label="Data Points",
)

# Add trend line (for visualization purposes)
z = np.polyfit(df["Coffee Cups"], df["Productivity"], 1)
p = np.poly1d(z)
plt.plot(df["Coffee Cups"], p(df["Coffee Cups"]), color="red", label="Trend Line")

plt.xlabel("Coffee Cups", fontsize=12)
plt.ylabel("Productivity", fontsize=12)
plt.title("Coffee Consumption vs. Productivity (Spearman Correlation)", fontsize=14)
plt.legend()
plt.grid(axis="y", alpha=0.5)
plt.show()

In our example, let’s say we get a Spearman correlation of 0.85. This indicates a strong positive monotonic relationship between coffee consumption and productivity. While the relationship isn’t perfectly linear, it’s clear that as coffee intake increases, so does productivity (at least within the range of our data).

Conclusion

Spearman’s rank correlation is a versatile tool in a machine learning practitioner’s toolbox. It allows you to uncover relationships that Pearson’s correlation might miss, particularly when dealing with non-linearity, ordinal data, or the presence of outliers. By understanding its underlying math and leveraging Python’s capabilities, you can confidently explore and quantify associations in your data, ultimately leading to more informed modeling decisions.

Remember, Spearman’s rank correlation isn’t just about numbers; it’s about revealing the hidden patterns that drive real-world phenomena. Whether you’re analyzing survey results, financial trends, or scientific observations, this statistical technique empowers you to make sense of complex relationships and gain deeper insights into your data.