Descriptive Statistics with Python — Learning Day 5

Correlation and causation

Gianpiero Andrenacci
14 min readJul 30, 2024
Descriptive Statistics with Python — All rights reserved

Correlation vs. Causation: Understanding the Difference

In data analysis, one of the most important concept to understand is that there is a difference between correlation and causation. While these characteristics can sometimes exist simultaneously, it’s important to recognize that correlation does not imply causation.

Correlation

Correlation is a statistical measure that describes the extent to which two variables change together. If two variables have a strong correlation, it means that when one variable changes, the other tends to change in a specific direction (positive or negative). However, correlation alone does not provide evidence that one variable causes the other to change.

Types of Correlation

Correlation can be categorized into three types:

  1. Positive Correlation: When one variable increases, the other variable also increases. For example, height and weight often show a positive correlation — taller individuals tend to weigh more.
  2. Negative Correlation: When one variable increases, the other variable decreases. For instance, the speed of a car and the travel time for a fixed distance usually show a negative correlation—the faster the speed, the shorter the travel time.

Scatterplot: Visualizing Correlation

A scatterplot is a graphical representation used to depict the relationship between two variables. Each point on the scatterplot represents a pair of values. By observing the pattern of the points, we can infer the type and strength of the correlation.

Here is an example using Python and the matplotlib library to create a scatterplot. This code creates a single figure with three subplots, each representing a different type of correlation: positive, negative, and no correlation. The plt.subplot function is used to arrange the plots in a 1x3 grid for easy comparison.

import matplotlib.pyplot as plt

# Sample data for positive correlation
x_pos = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_pos = [2, 4, 5, 6, 8, 10, 12, 14, 16, 18]

# Sample data for negative correlation
x_neg = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_neg = [18, 16, 14, 12, 10, 8, 6, 5, 4, 2]

# Sample data for no correlation
x_none = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_none = [5, 9, 3, 12, 7, 6, 11, 4, 15, 8]

# Creating the scatterplots
plt.figure(figsize=(15, 5))

# Positive correlation
plt.subplot(1, 3, 1)
plt.scatter(x_pos, y_pos)
plt.title('Positive Correlation')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')

# Negative correlation
plt.subplot(1, 3, 2)
plt.scatter(x_neg, y_neg)
plt.title('Negative Correlation')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')

# No correlation
plt.subplot(1, 3, 3)
plt.scatter(x_none, y_none)
plt.title('No Correlation')
plt.xlabel('X Variable')
plt.ylabel('Y Variable')

plt.tight_layout()
plt.show()
Scatterplot

Correlation Coefficient

The correlation coefficient is a numerical measure that quantifies the strength and direction of the relationship between two variables. The most commonly used correlation coefficient is the Pearson correlation coefficient (r), which ranges from -1 to 1.

  • r = 1: Perfect positive correlation
  • r = -1: Perfect negative correlation
  • r = 0: No correlation

To calculate the Pearson correlation coefficient in Python, we can use the numpy library:

import numpy as np

# Sample data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 4, 5, 7, 8, 9, 9, 10]

# Calculating the Pearson correlation coefficient
correlation_matrix = np.corrcoef(x, y)
correlation_coefficient = correlation_matrix[0, 1]
print(f"Pearson correlation coefficient: {correlation_coefficient}")
Pearson correlation coefficient: 0.830431826364408

Interpreting Correlation

When interpreting the correlation coefficient, consider the following:

  • Strength: The closer the coefficient is to ±1, the stronger the correlation.
  • Direction: A positive coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship.

When the correlation coefficient (r) is equal to 1 (or -1), it indicates a perfect positive (or negative) correlation. This is a limit case and, in real-world data, is extremely rare. If you encounter a correlation of exactly 1 or -1, it is prudent to check the data and the methodology applied for the calculation.

Such a perfect correlation might suggest:

  • Data Errors: There could be data entry mistakes or duplicated data.
  • Methodological Issues: The approach used to calculate the correlation may be flawed or improperly implemented.
  • Artificial Data: The data set might be artificially constructed or manipulated to produce a perfect correlation.

In these situations, it’s important to scrutinize the data and the computational methods to ensure the integrity and validity of the analysis.

What is more, always remember that correlation does not imply causation. Two variables might be correlated due to a third variable influencing both (confounding variable).

Impact of Range Restrictions

Typically, the value of the correlation coefficient decreases when the range of possible X or Y scores is restricted. This effect is similar to zooming in on a subset of the original data, which can obscure the overall pattern. For example, in a dataset showing the relationship between height and weight among college students, the correlation (r) might be .70.

However, if we limit the dataset to students taller than 6 feet 2 inches, the correlation may drop to .10 due to the reduced variability in weight among these taller students.

Range restrictions can be unavoidable in some cases. For instance, colleges that only accept students with SAT scores above a certain threshold inadvertently restrict the range of SAT scores. This restriction can lower the correlation between SAT scores and college GPAs since there are no students with lower SAT scores. It’s crucial to always check for any restrictions on the ranges of X or Y scores, whether intentional or accidental, that could potentially reduce the value of r.

However, it’s important to note that this isn’t always the case. The effect of range restriction on the correlation coefficient can depend on the specific distribution and relationship of the variables in the dataset. So while it’s a common effect, it’s not a hard and fast rule.

Dealing with Outliers

In real-world research, investigators typically work with a large number of data points, making the impact of outliers less dramatic. However, outliers can still significantly affect the value of the correlation coefficient (r), complicating the interpretation of the results. It’s essential to recognize and address outliers, as they can distort the true relationship between variables and lead to misleading conclusions. Proper data analysis should include strategies for identifying and mitigating the influence of outliers to ensure accurate and reliable results.

Independence of r from Units of Measurement

The correlation coefficient, denoted as r, remains consistent regardless of the units of measurement used. For instance, the correlation between height and weight for a group of adults will have the same r value whether height is measured in inches or centimeters, and weight in pounds or grams. This is because r reflects the pattern among pairs of scores, devoid of any influence from the original units of measurement. Essentially, a positive r indicates that high scores on one variable tend to pair with high scores on another, while a negative r signifies that high scores on one variable tend to pair with low scores on the other, irrespective of the units used.

Example of Correlation Usage in Data Science

In data science, correlation is often used to understand relationships between variables, which can inform feature selection, identify potential predictive variables, and help in exploratory data analysis. Let’s consider a practical example where we examine the correlation between advertising spending and sales revenue.

Scenario: Advertising Spend vs. Sales Revenue

Imagine we are working for a company that wants to analyze the impact of its advertising spend on sales revenue. We have collected monthly data for the past year on advertising spend and corresponding sales revenue. We want to determine if there is a relationship between these two variables.

The np.corrcoef function in NumPy calculates the Pearson correlation coefficient matrix. This matrix shows the linear relationship between pairs of variables. Unlike Pandas' .corr method, np.corrcoef works directly with arrays or matrices, making it versatile for various data formats.

import pandas as pd

# Data: Monthly advertising spend (in thousands) and sales revenue (in thousands)
data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
'Advertising_Spend': [10, 12, 13, 15, 20, 21, 25, 22, 23, 24, 26, 30],
'Sales_Revenue': [40, 55, 56, 60, 70, 72, 78, 65, 77, 80, 79, 83]
}

df = pd.DataFrame(data)



import matplotlib.pyplot as plt

plt.scatter(df['Advertising_Spend'], df['Sales_Revenue'])
plt.title('Advertising Spend vs Sales Revenue')
plt.xlabel('Advertising Spend (in thousands)')
plt.ylabel('Sales Revenue (in thousands)')
plt.show()

import numpy as np

correlation_matrix = np.corrcoef(df['Advertising_Spend'], df['Sales_Revenue'])
correlation_coefficient = correlation_matrix[0, 1]
print(f"Pearson correlation coefficient: {correlation_coefficient}")
Pearson correlation coefficient: 0.9481481741629049

Interpretation

  • Scatterplot: The scatterplot visually shows a positive relationship between advertising spend and sales revenue. As advertising spend increases, sales revenue also tends to increase.
  • Correlation Coefficient: The Pearson correlation coefficient is calculated to be approximately 0.94, indicating a very strong positive correlation.

Practical Implications

  • Feature Selection: Since advertising spend has a strong positive correlation with sales revenue, it can be considered a significant feature for predictive modeling.
  • Business Insights: The company can infer that increasing advertising spend is likely to lead to higher sales revenue, aiding in budget allocation decisions.

This example illustrates how correlation can be used in data science to understand the relationship between variables. By visualizing the data and calculating the correlation coefficient, we gain valuable insights that can inform business decisions and feature selection in predictive models.

In data science, calculating the correlation coefficient matrix with np.corrcoef can help identify relationships between variables. This information is crucial for:

  • Feature Selection: Identifying and removing redundant features that have high correlations.
  • Exploratory Data Analysis (EDA): Understanding the structure and relationships in the dataset.

Cautionary Note

Once again, while a high correlation indicates a strong relationship, it does not imply causation. Other factors might influence sales revenue, and it is essential to conduct further analysis, such as controlled experiments or regression modeling, to establish a causal relationship.

Using the .corr Method in Pandas

The .corr method in Pandas is used to calculate the correlation matrix for a DataFrame. This matrix shows the correlation coefficients between pairs of variables in the DataFrame. Correlation measures the linear relationship between two variables, and the .corr method allows for different types of correlation measures.

DataFrame.corr(method='pearson')
  • method: The method of correlation to use.
  • ‘pearson’ (default): Standard correlation coefficient.
  • ‘kendall’: Kendall Tau correlation coefficient.
  • ‘spearman’: Spearman rank correlation.

Example Usage

Let’s consider a DataFrame with three variables: Advertising_Spend, Sales_Revenue, and Customer_Satisfaction. We'll calculate the correlation matrix for these variables.

import pandas as pd

# Sample data
data = {
'Advertising_Spend': [10, 12, 13, 15, 20, 21, 25, 22, 23, 24, 26, 30],
'Sales_Revenue': [50, 55, 56, 60, 70, 72, 78, 75, 77, 80, 82, 90],
'Customer_Satisfaction': [7, 8, 8, 7, 9, 9, 9, 8, 8, 10, 9, 10]
}

df = pd.DataFrame(data)

# Calculating the correlation matrix using Pearson method
correlation_matrix = df.corr(method='pearson')
print("Pearson Correlation Matrix:")
print(correlation_matrix)

# Calculating the correlation matrix using Kendall method
correlation_matrix_kendall = df.corr(method='kendall')
print("\nKendall Correlation Matrix:")
print(correlation_matrix_kendall)

# Calculating the correlation matrix using Spearman method
correlation_matrix_spearman = df.corr(method='spearman')
print("\nSpearman Correlation Matrix:")
print(correlation_matrix_spearman)

The correlation matrix output will look something like this for the Pearson method:

Pearson Correlation Matrix:
Advertising_Spend Sales_Revenue Customer_Satisfaction
Advertising_Spend 1.000000 0.997008 0.793439
Sales_Revenue 0.997008 1.000000 0.799588
Customer_Satisfaction 0.793439 0.799588 1.000000

Kendall Correlation Matrix:
Advertising_Spend Sales_Revenue Customer_Satisfaction
Advertising_Spend 1.000000 0.969697 0.614510
Sales_Revenue 0.969697 1.000000 0.648649
Customer_Satisfaction 0.614510 0.648649 1.000000

Spearman Correlation Matrix:
Advertising_Spend Sales_Revenue Customer_Satisfaction
Advertising_Spend 1.000000 0.993007 0.756969
Sales_Revenue 0.993007 1.000000 0.778805
Customer_Satisfaction 0.756969 0.778805 1.000000
Pearson Correlation Matrix:
Advertising_Spend Sales_Revenue Customer_Satisfaction
Advertising_Spend 1.000000 0.981027 0.596555
Sales_Revenue 0.981027 1.000000 0.624088
Customer_Satisfaction 0.596555 0.624088 1.000000
  • Diagonal Elements: The diagonal elements are all 1, as each variable is perfectly correlated with itself.
  • Off-Diagonal Elements: These elements show the correlation coefficients between different pairs of variables. For example, the correlation between Advertising_Spend and Sales_Revenue is approximately 0.98, indicating a very strong positive correlation.

The .corr method in Pandas is a powerful tool for analyzing the linear relationships between variables in a DataFrame. By specifying different correlation methods ('pearson', 'kendall', 'spearman'), you can gain insights into different types of relationships, making it a versatile function for exploratory data analysis and feature engineering in data science projects.

Varieties of Correlation Coefficients

While there are numerous types of correlation coefficients, this discussion will focus on those that are direct extensions of the Pearson correlation coefficient. Originally designed for quantitative data, the Pearson r has been adapted for various scenarios, sometimes under different names or customized versions of its original formula.

For example, to describe the correlation between ranks independently assigned by two judges to a set of science projects, you can substitute the numerical ranks into the Pearson formula, yielding a value known as Spearman’s rho coefficient, which is used for ranked or ordinal data.

To describe the correlation between quantitative data (e.g., annual income) and qualitative or nominal data with two categories (e.g., male and female), you can assign arbitrary numerical codes (such as 1 and 2) to the qualitative categories and then solve the Pearson formula. This adaptation is referred to as the point-biserial correlation coefficient.

For relationships between two ordered qualitative variables, such as attitude toward legal abortion (favorable, neutral, or opposed) and educational level (high school, some college, college graduate), you can assign ordered numerical codes (such as 1, 2, and 3) to the categories of both variables and solve the Pearson formula. This variation is known as Cramer’s phi coefficient.

The Kendall correlation coefficient, also known as Kendall’s tau, is a statistical measure used to evaluate the strength and direction of association between two ordinal or ranked variables. Unlike other correlation coefficients that rely on numerical values, Kendall’s tau is particularly useful for understanding relationships where the data is in the form of ranks.

The calculation of Kendall’s tau involves examining pairs of observations and determining whether they are concordant or discordant. Concordant pairs are those where the ranks for both variables move in the same direction; that is, if one observation is ranked higher than another in one variable, it is also ranked higher in the other variable. Discordant pairs, on the other hand, are those where the ranks move in opposite directions; if one observation is ranked higher in one variable, it is ranked lower in the other.

Kendall’s tau is computed by taking the difference between the number of concordant pairs and the number of discordant pairs, and then dividing this difference by the total number of pairs. This method provides a measure of the degree to which the rankings of the two variables align or oppose each other.

In essence, Kendall’s tau offers a nuanced view of the relationship between two ranked variables, highlighting the consistency of their orderings and providing a clear indication of whether they tend to increase together, decrease together, or show no consistent pattern at all.

Let’s consider a DataFrame with three variables: Advertising_Spend, Sales_Revenue, and Customer_Satisfaction. We'll calculate the correlation matrix for these variables.

import scipy.stats as stats

# Example data
x = [1, 2, 3, 4, 5]
y = [5, 6, 7, 8, 7]

# Calculate Kendall's tau
tau, p_value = stats.kendalltau(x, y)

print(f"Kendall's tau: {tau}")
print(f"P-value: {p_value}")
Kendall's tau: 0.6
P-value: 0.164
  • Kendall’s tau value: Indicates the strength and direction of the association.
  • P-value: Helps in hypothesis testing to determine the significance of the association.

The Kendall correlation coefficient is useful for assessing the ordinal relationship between two variables, providing insight into how one variable may be associated with another in terms of rank or order.

Causation in Data Science

Causation refers to a relationship where one event (the cause) directly leads to another event (the effect). Unlike correlation, which simply indicates that two variables are related, causation implies a direct influence. Establishing causation is crucial in making accurate predictions and decisions, but it requires more rigorous testing and evidence.

Challenges in Establishing Causation

  • Confounding Variables: These are extraneous variables that correlate with both the independent and dependent variables, potentially misleading the observed relationship.
  • Reverse Causality: This occurs when the direction of cause and effect is unclear, e.g., does increased sales lead to more advertising or vice versa?
  • Simultaneity: When two variables mutually influence each other, disentangling the cause and effect becomes complex.

Method to Establish Causation

To establish causation, it is essential to rule out other potential explanations and confounding factors. This typically involves controlled experiments and thorough analysis. Here are the key steps involved:

Controlled Experiments: The gold standard for establishing causation is the randomized controlled trial (RCT). In an RCT, participants are randomly assigned to either the treatment group or the control group, ensuring that any differences observed are due to the treatment itself and not other factors.

Suppose a company wants to determine whether increasing advertising spend causes an increase in sales revenue. Conducting an RCT would involve randomly assigning different advertising budgets to different regions and measuring the resulting sales. By comparing the sales from regions with higher advertising spend to those with lower or no advertising, and ensuring all other factors are constant, the company can more confidently establish causation.

# Pseudocode for a randomized controlled trial
import random

population = list(range(1000)) # Simulating a population of 1000 individuals
treatment_group = random.sample(population, 500) # Randomly assigning 500 to the treatment group
control_group = list(set(population) - set(treatment_group)) # The rest are in the control group

# Simulating outcomes
treatment_outcomes = [random.gauss(70, 10) for _ in treatment_group] # Treatment group outcomes
control_outcomes = [random.gauss(65, 10) for _ in control_group] # Control group outcomes

# Analyzing the results
average_treatment_outcome = sum(treatment_outcomes) / len(treatment_outcomes)
average_control_outcome = sum(control_outcomes) / len(control_outcomes)

print(f"Average outcome for treatment group: {average_treatment_outcome}")
print(f"Average outcome for control group: {average_control_outcome}")
Average outcome for treatment group: 69.69939129006517
Average outcome for control group: 64.72221290599741

Longitudinal Studies: These studies follow subjects over time to observe how changes in one variable affect another. Although not as rigorous as RCTs, they can provide strong evidence of causation.

Statistical Techniques: Methods like regression analysis and structural equation modeling can help control for confounding variables and identify potential causal relationships (do not worry if you don’t understand this , we’ll see this method in detail in the last article of the series).

import statsmodels.api as sm

# Sample data: advertising spend and sales revenue
X = df['Advertising_Spend']
y = df['Sales_Revenue']

# Adding a constant for the intercept
X = sm.add_constant(X)

# Performing a linear regression
model = sm.OLS(y, X).fit()
results = model.summary()
print(results)
  OLS Regression Results                            
==============================================================================
Dep. Variable: Sales_Revenue R-squared: 0.899
Model: OLS Adj. R-squared: 0.889
Method: Least Squares F-statistic: 89.00
Date: Fri, 12 Jul 2024 Prob (F-statistic): 2.70e-06
Time: 13:25:29 Log-Likelihood: -33.536
No. Observations: 12 AIC: 71.07
Df Residuals: 10 BIC: 72.04
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 28.2534 4.387 6.441 0.000 18.479 38.028
Advertising_Spend 1.9749 0.209 9.434 0.000 1.508 2.441
==============================================================================
Omnibus: 3.022 Durbin-Watson: 1.692
Prob(Omnibus): 0.221 Jarque-Bera (JB): 2.010
Skew: -0.963 Prob(JB): 0.366
Kurtosis: 2.440 Cond. No. 73.6
==============================================================================

Instrumental Variables: Sometimes, natural experiments or external instruments can help isolate causal effects. An instrumental variable is related to the treatment but not directly to the outcome except through the treatment.

Causation is critical for making informed decisions based on data. While correlation provides initial insights into relationships between variables, establishing causation requires rigorous methods to rule out confounding factors and other potential explanations. By employing controlled experiments, longitudinal studies, and robust statistical techniques, data scientists can uncover true causal relationships that drive actionable insights.

Spurious Correlations

Spurious correlations occur when two variables appear to be related, but the relationship is actually caused by a third variable or is coincidental. These correlations can be misleading and do not imply causation.

https://www.tylervigen.com/spurious/correlation/5905_frozen-yogurt-consumption_correlates-with_violent-crime-rates

It’s important to remember that just because two variables are correlated does not mean that one causes the other. Misinterpreting correlation as causation can lead to incorrect conclusions and poor decision-making.

Understanding the difference between correlation and causation is imperative for accurate data analysis. While correlation can suggest a relationship between two variables, it does not prove that one causes the other. To establish causation, more rigorous testing and evidence are needed. Recognizing spurious correlations helps avoid misleading conclusions and ensures better decision-making based on data.

--

--

Gianpiero Andrenacci

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.