Pearson correlation: Methodology, Limitations & Alternatives — Part 1 : Methodology

Anthony Demeusy
11 min readJun 4, 2023

--

The Pearson correlation coefficient is a staple of data analysis. It is easy to compute and plenty of implementations are available. Its association with linear regression often makes it one of the first tools that is presented when learning to model relationship between variables. It is also commonly used for feature selection in machine learning. Nonetheless, issues with its use are often observed, misuse and misinterpretation are surprisingly common, and both the conditions under which it can be used and its limitations not always well understood.

This article is the first of a series of three articles providing a non-mathematical overview of Pearson correlation analysis. These articles use Layman terms, without diving into the mathematical aspects (nothing more complicated than y=ax+b), technical implementation or coding. They are to serve as a guide to properly design, understand and interpret a Pearson correlation analysis.

This first article discusses the main properties of the Pearson correlation coefficient, while the second article focuses on its limitations, and the third one describes some of the alternatives and complements to the Pearson correlation coefficient.

I. Pearson correlation analysis overview

1. What is the Pearson correlation coefficient?

The Pearson coefficient, also known as the Pearson product-moment correlation coefficient, or Bravais-Pearson correlation coefficient, is a metric that evaluates the strength of a linear relationship between 2 variables x and y¹. In other terms, it is an indication of how well the relationship between the two variables can modelled using a y=ax+b form, with a and b being constants. The Pearson correlation coefficient is noted ρ for a complete population and r for a sample of observations (e.g. a cohort of patients). Data is typically available for a sample of the population, but not for the complete population. In this case, only the Pearson correlation coefficient on the sample r can be obtained, and it is an estimate of the actual Pearson correlation coefficient on the population ρ.

The value of the Pearson correlation coefficient value is comprised between -1 and 1. Let’s look at a few examples for different simulated datasets in order to gain a first intuition of its interpretation:

First, the absolute value of the Pearson correlation coefficient indicates strength of the correlation. Though the thresholds are arbitrary and there are many variations, the table below provides a good interpretation guide²:

An absolute value of 1 indicates a perfect correlation. As the strength of a linear relationship decreases, so does the r’s absolute value. This can be caused by outliers, measurement error, noise, the effect of hidden variables, or the existence of a non-linear relationship. A value of 0 indicates that there is no linear correlation in the sample.

The sign of the Pearson coefficient is also informative as it tells whether the 2 variables tend to move in the same direction or in opposite direction. A Pearson coefficient r between 0 and 1 indicates a direct or positive correlation, meaning that y tends to increase when x increases. Similarly, if r is between -1 and 0, it indicates an indirect or negative correlation. In this case, x and y tend to move in opposite direction, y decreases when x increases.

Note that the expression “negative correlation” can be misleading as it may suggest the absence of correlation, or that a positive correlation is preferable. In reality, this is not an indication of the strength of the relationship between the variables. Only the absolute value of the Pearson coefficient can inform us on the strength of the relationship. A Pearson coefficient of -0.9 only indicates a stronger correlation than a coefficient of 0.7, and a coefficient of 0.8 correlation is not better than a -0.8. A correlation can be very strong and negative at the same time. For this reason, some authors advocate for using only the “direct correlation” and “indirect correlation” terminology².

Before looking into confidence intervals and statistic, let’s briefly review the key properties of the Pearson correlation coefficient.

2. Properties

a. Link with the coefficient of determination

Let’s start with a property that is very useful to interpret a correlation analysis: the square of the Pearson correlation coefficient, usually noted and known as the Coefficient of Determination, represents the percentage of variation of one variable that is explained, using ordinary least square regression, by the variations of the other variable³.

As an example, let’s look at the life expectancy against the schooling duration per country⁴. A scatterplot suggests a linear relationship:

The Pearson correlation coefficient is 0.767, showing a strong correlation. Based on this, the value of is 0.59 (= 0.77²). This means that, from a statistical standpoint, a linear regression model can predict 59% of the variations of the life expectancy based on the schooling durations.

b. Symmetry

The Pearson correlation coefficient is symmetric. You can express y as a function of x or the other way around, you can flip the axis when plotting data, this will not affect the Pearson correlation coefficient.

Re-using the same example as previously, The mean years of schooling can be plotted against the life expectancy and the value of the correlation coefficient remains unchanged.

c. Insensitivity to scale and location

Unlike the covariance, the Pearson coefficient is dimensionless. It is not expressed in any unit, and it is insensitive to scale and location change⁵. In other terms, you can multiply x and y by any positive number, add or subtract a constant to them, this will have no effect on the result. Consequently, the units in which x and y are measured do not matter. If you are looking at the relationship between a duration and a distance, you can express x in meters, yards, kilometers, y in seconds or in hour, use different origins or starting point, the scatterplot will change, the apparent slope may look different, the intercept at the origin will shift, but the correlation coefficient will remain the same. And if you multiply x or y by a negative number, only the sign of r will change. Isn’t that nice?

Some real-life data can be used to illustrate this. Both scatterplots below represent the time in race and distance data for 30 random finishers of the UTMB 2022, an annual long distance trail-running event⁶. The first chart displays the distance covered expressed in kilometers and the time since the start of the race expressed in hours. The second chart displays the distance in miles and the time in minutes.

The units on both axes are different, the values on the x-axis were converted from kilometers to miles by dividing them by 1.61, those on the y-axis were converted from hours to minutes by multiplying them by 60, the overall slope is different and though, yet the correlation coefficient remains unchanged.

Now, to illustrate what happens when adding or subtracting a constant to x or y, or when multiplying a variable by a negative number, consider a company that manufactures products upon receiving an order. Each order has a set readiness objective, and any delay of more than 3 working days incurs a financial penalty. For each order, the company can track the following:
- Delay measured against the objective
- Delay measured against the deadline before financial penalty .

Instead of looking at the delay, the company can also track the safety margin against the objective or against the deadline for each order. The safety margin is simply the opposite of the delay. An order ready one working day early has a safety margin of 1 day, and a delay of -1.

These 4 measures are linked through addition or subtraction of a constant, or through multiplication by -1. Examing scatterplots of these metrics against the order size, it becomes clear that the adding or subtracting a value has no effect on the Pearson coefficient and multiplying by -1 simply flips its sign while and its absolute value remains the same.

d. Conditions for existence

For the Pearson correlation coefficient to be defined, the following conditions must be met:
- The variables are measured at interval or ratio level, which means that both variables are quantitative and can be represented by a real number.
- The data is organized in paired observations : there is one value of x corresponding to each value of y. They can be displayed as a 2-columns table.
- Variance and covariance of x and y are defined, and the variances are non-null⁷. In practice, the only risk here is to get a variance equal to 0. That means that either x and/or y keeps the same value, like in the 2 graphs below.

Figure: Bivariate distributions with undefined Pearson correlation coefficients

It is sometimes reported that a linear relationship between the variables is needed, but the Pearson correlation coefficient can be computed even without assuming a linear relationship between the variables. If there is no linear relationship, then its value will simply be close to zero.

Similarly, the data does not necessarily have to be normally distributed to compute The Pearson correlation coefficient⁸. For instance, you have certainly observed that, on the scatterplot showing racing times against distance, the distance measures do not follow a normal distribution. The observations are organized in vertical bars since the racing time is measured at a set of predefined distances into the race. Besides, the spread of the points increases with x, resulting in a fan shape (that’s a sign of ‘Heteroskedasticity ’, more on this in the next section). Nonetheless, the existence of a linear relationship is visible and the Pearson correlation coefficient is defined.

As you can see, the conditions for existence of the Pearson correlation coefficient are rather simple. However, additional requirements come into play when looking at statistical significance and confidence intervals.

3. Statistical significance & confidence interval

In most cases, the Pearson correlation coefficient r is known only for a sample and it serves as an estimate of the true correlation coefficient r for the whole population. As a consequence, there is often an interest in knowing whether the result is statistically significant and in getting a confidence interval for the actual value of correlation coefficient r. This can be achieved using traditional statistical methods, but additional requirements must be fulfilled. Indeed, the data must meet a set of 4 test assumptions that can be summarized using the LINE mnemonics:

LINE mnemonics

Considering these requirements, the previous dataset describing the relationship between the distance and running time in race for a sample of runners would not be a good candidate as the distance values (x) are not normally distributed. Besides, the spread of the racing time (y) increases with the distance. In other terms, the variance of y does not remain equal as x changes. This is heteroscedasticity, the contrary of homoskedasticity. Finally, the samples are not independent since several data points correspond to the same runner.

If a dataset meets the requirements for significance test, the objective is usually to check whether the correlation coefficient between the two variables is different from 0. This is achieved using a two-tailed test. The first step consists in assuming that the true Pearson correlation coefficient for the complete population is equal to zero, i.e that there is no correlation. This is the null hypothesis. If the test rejects this hypothesis, that will lead to accept that is not equal to 0, i.e there is a significant linear relationship. This is the alternative hypothesis. The 2 hypotheses are noted as follows:

Under the test assumptions, r follows a normal distribution with a zero mean. The mathematical details are beyond the scope of this post but, from there, a one-sample t-test can be used to obtain a p-value. If the p-value is under a predefined acceptable risk threshold, for instance 5%, then the null hypothesis is rejected, the alternative hypothesis is accepted, leading to the conclusion that the correlation coefficient for the population is not null. Otherwise, as for any statistical significance test, keep in mind that failing to reject the null hypothesis does NOT imply that the null hypothesis is true.

Though the t-test is the most common statistical significance test for Pearson correlation coefficient, it is valid only when using ρ = 0 as null hypothesis. Without this assumption, it can’t be assumed that r follows a normal distribution and the whole reasoning does not hold anymore⁹. Other methods have been described, and can be used for the case H₀: ρ ≠ 0, meaning that the objective is to test whether r is different from a value that is not 0. The most common of these techniques relies on the Fisher transformation, a mathematical operation that takes the Pearson coefficient as an input and outputs a variable that follows a normal distribution. Though other methods have been described for this, the same Fisher transformation can also be used to build a confidence interval for the Pearson correlation coefficient. Nevertheless, in many real-life cases, the data does not meet the LINE requirements. But worry not!

Practically, traditional estimation methods are often used with “reasonable” performance while some of the LINE conditions are violated¹⁰. That’s particularly true in case of deviation from the normality assumption. In this case, using traditional statistical methods is not exact but is (usually) acceptable. Nonetheless, resampling methods based on permutation test and bootstrapping now offer a great alternative and can be used even if strong assumptions regarding the distribution can’t be drawn.

This concludes the first part of this series of articles regarding the Pearson correlation analysis. The next article will focus on the limitations of the Pearson correlation coefficient, then the third will introduce some of the available alternatives and complements.

If you found this article helpful, please show your support by clapping for this article and considering subscribing for more articles on machine learning and data analysis. Your engagement and feedback are highly valued as they play a crucial role in the continued delivery of high-quality content.

You can also support my work by buying me a coffee. Your support helps me continue to create and share informative content. It’s a simple and appreciated gesture that keeps the momentum going : Buy Me a Coffee.

References
[1] Dodge, Yadolah. 2008. The Concise Encyclopedia of Statistics. New York, NY: Springer New York.

[2] Salkind, Neil J. 2016. Statistics for People Who (Think They) Hate Statistics. 6th ed. Thousand Oaks, CA: Corwin Press.

[3] Everitt, B. S., and Anders Skrondal. 2010. The Cambridge Dictionary of Statistics: Fourth Edition. Cambridge, England: Cambridge University Press.

[4] “Subnational HDI — Global Data Lab.” n.d. Globaldatalab.org. Accessed December 9, 2022. https://globaldatalab.org/shdi/download/lifexp/.

[5] Kim, Hae-Young. 2018. “Statistical Notes for Clinical Researchers: Covariance and Correlation.” Restorative Dentistry & Endodontics 43 (1). https://doi.org/10.5395/rde.2018.43.e4.

[6] N.d. Utmb.World. Accessed December 8, 2022. https://live.utmb.world/utmb/utmb.

[7] Rasch, Dieter, and Dieter Schott. 2018. Mathematical Statistics. Standards Information Network.

[8] Nefzger, M. D., and James Drasgow. 1957. “The Needless Assumption of Normality in Pearson’s r.” The American Psychologist 12 (10): 623–25. https://doi.org/10.1037/h0048216.

[9] Puth, Marie-Therese, Markus Neuhäuser, and Graeme D. Ruxton. 2014. “Effective Use of Pearson’s Product–Moment Correlation Coefficient.” Animal Behaviour 93: 183–89. https://doi.org/10.1016/j.anbehav.2014.05.003.

[10] Havlicek, Larry L., and Nancy L. Peterson. 1977. “Effect of the Violation of Assumptions upon Significance Levels of the Pearson r.” Psychological Bulletin 84 (2): 373–77. https://doi.org/10.1037/0033-2909.84.2.373.

--

--

Anthony Demeusy

Project Manager, Biomedical Engineer, Data science & AI enthusiast