Averaging Correlations — Part I

Replication and wide extension of Corey, Dunlap, & Burke (1998)

7 min readApr 5, 2020

It is not possible to average correlation coefficients. In my studies I found that in every textbook. Do not average correlations. In a series of stories I will explore the performance of several correction methods and extend previously published results.

This first episode introduces the approach and investigates Fisher’s and Hotelling’s z (Fisher, 1921; Hotelling, 1953) with averaged correlations taken from a correlation matrix based on normal distributed samples. Further episodes will extend the findings with other correction methods and non-normal sample distributions.

Introduction

Mathematically, it is true: the mere average of several correlation coefficients makes no sense. Correlations are not additive is the “simple” way to say it. But sometimes the practitioner needs a way out. In my case, I was simply interested in an estimate of the total correlation of several samples. The simple and accurate solution would be to ignore the sample correlations and calculate a new correlation across all samples. In my case, I lose the original samples after some time and our data storage capacities made it impossible to keep everything. But what happens when we average correlations, anyway?

The problem is that our results will be biased. When we average values from different data sets our purpose is to reduce random fluctuations in the data and get closer to the true value. But with correlation coefficients we will not get closer to the true correlation. An average of correlations will be biased.

The heat map on the left illustrates the issue. We can see the simulation results. The simulation created datasets, computed the correlations and averaged them. It did that for different numbers of data sets (3–10) and different numbers of samples within each set (10–50). For each combination of data sets and sample size the simulation was repeated 50 thousand times. The heat map shows the results. Each square shows the difference between the theoretical correlation the simulation put into the data sets (called ρ, which is Greek for Rho) and the resulting average of correlations r. The simulation reveals that there is a large difference between the observed r and what it is supposed to be. The difference goes as high as 0.023. The average clearly and systematically underestimates the true correlation. This bias changes with the number of samples in each data set. It is independent from the number of data sets.

There clearly is a bias. Still, I needed to aggregate correlations for a project and I remembered the Fisher z transformation that is supposed to alleviate the bias (Fisher, 1921). When we transform the correlations, average them and then do the inverse transformation, we get a better estimate. After some additional digging I found a statement that the problem would be negligible in many cases. Even without Fisher Z “the absolute bias becomes negligible (less than .01) for a sample size greater than 20” (Bishara & Hittner, 2015). Thus, if sample sizes are large enough the bias shows only on the third digit.

There are many other correction methods. There is one by Hotelling (1953) will be part of this first story. Another had been suggested by Olkin & Pratt (1958) which will be the focus of the next issue. I am doing this to shed some more light on the distribution and strength of the bias after correcting for it.

The Fisher z Transformation

Another run of simulations used the Fisher z correction. This simulation replicates Corey et al. (1998) by calculation and averaging correlation by varying three parameter:

A range of correlations from 0.00 to 0.95 ins steps of .05.
Averaging 3 to 45 correlations based on 3 to 10 intercorrelated data sets.
The score pairs (N) making up a single correlation varied between 10 to 50 in steps of 10.

The bias is shown as the difference between the theoretical correlation ρ (Rho) and the averaged sample correlations.

The line chart is the equivalent to figure 1 in Corey et al. (1998) and replicates their findings. It compares the uncorrected correlations with the Fisher z transformation. The bias is largest for intermediate values of ρ. After correcting them the maximum bias moves down to ρ=0.3.

As expected, Fisher performs a lot better, in general. The total range of the bias is only slightly smaller but the values are distributed more symmetrically around zero. Fishers average deviation from zero (zad) is one third of the uncorrected ones. It is not surprising, that Fisher returns a result that is closer to zero in 753 of 800 values (see CtZ in table 1). Clearly, Fisher z provides better results than the uncorrected averages.

The violin plots illustrate that clearly. Fisher z has a stronger body and is more symmetrical. The horizontal lines indicate the 5%, 50% and 95% quantile. Clearly, the values of Fisher z lie closer together and closer to zero.

The heat map illustrates the remaining bias using a Fisher z correction. It gets smaller with sample size. It changes its pattern with the number of data sets (D). For a small D Fisher z underestimates ρ with ρ becoming larger. The more data sets we have, the more we overestimate ρ which is most pronounced in the range of ρ = 0.35. But what is the most interesting message: the more samples we have in each data set, the better our results will be using Fisher z.

Hotelling

Hotellings correction does not receive much attention in the literature. Alexander, Hanges, & Alliger (1985) concluded that Hotelling is not superior to Fisher’s z. Their focus was on the variance of the transformed variable whereas I am interested in accurate means. Now it turned out that it was tricky to get accurate means. It may be another reason why nobody adopted the approach is the lack of an inverse transformation. The inverse of the Fisher z transformation is documented on Wikipedia. An inverse of the Hotelling transformation was nowhere to be found and I had to invent my own approach (see the repository).

Let us compare Hotelling with Fisher z. At a first glance, it looks like Hotelling only changes the bias without doing much to alleviate it. It is extremely difficult to come to a conclusion visually.

The violin plots show the similarity of the two corrections methods. It seems that Hotelling does not really add much precision. The numbers do not differ on three digits. Only when we count the simulations that yielded a result closer to zero (CtZ), on average, we have to conclude that Fisher is superior to Hotelling.

The patterns of the bias did not deviate much from Fisher z which is why I decided not to show any more details. I did give it many chances to prove itself, tried the two version of Hotellings transformation (z* and z**, see page 224 in Hotelling, 1953) and it took me quite some time to find a solution for the inverse function to get back to correlations after averaging Hotelling z values. I think it is safe to conclude (along with Alexander et al., 1985) that the Hotelling z does not add accuracy to the Fisher z transform neither for the correlation means nor the variances.

Summary

This first episode re-established the decent performance of the well-known Fisher z correction. What has never been shown in that much detail is the pattern of the bias. The heat map above gives an excellent overview of the remaining bias after Fisher z.

Let us also finally close the coffin on the z-transform by Hotelling. It adds complexity without accuracy.

Please note that all these correction methods are based on three assumptions, which are 1) bivariate normality, 2) independence of observations, and 3) larger sample sizes. This first episode about the issue only challenges the third assumption. The first assumption will be the in the focus of further investigations. However, the next episode will provide more details on a family of corrections by Olkin & Pratt (1958), first.

Reference and Further Reading

The code for the simulations can be found in the GitHub repository “Averaging Correlations”. Feel also free to explore the data on Tableau Public.

Alexander, R. A. (1990). A note on averaging correlations. Bulletin of the Psychonomic Society, 28(4), 335–336. https://doi.org/10.3758/BF03334037

Alexander, R. A., Hanges, P. J., & Alliger, G. M. (1985). An empirical examination of two transformations of sample correlations. Educational & Psychological Measurement, 45(4), 797–801.

Bishara, A. J., & Hittner, J. B. (2015). Reducing bias and error in the correlation coefficient due to nonnormality. Educational and Psychological Measurement, 75(5), 785–804. https://doi.org/10.1177/0013164414557639

Corey, D. M., Dunlap, W. P., & Burke, M. J. (1998). Averaging correlations: Expected values and bias in combined pearson rs and fisher’s z transformations. The Journal of General Psychology, 125(3), 245–261. https://doi.org/10.1080/00221309809595548

Fisher, R. A. (1921). On the ’probable error’ of a coefficient of correlation deduced from a small sample. Metron, 1, 1–32. Retrieved from https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15169/1/14.pdf

Garcia, E. (2012). The Self-Weighting Model. Communications in Statistics — Theory and Methods, 41(8), 1421–1427. https://doi.org/10.1080/03610926.2011.654037

Hotelling, H. (1953). New light on the correlation coefficient and its transformations. Journal of the Royal Statistical Society, 15(2), 193–225.

Olkin, I., & Pratt, J. (1958). Unbiased estimation of certain correlation coefficients. The Annals of Mathematical Statistics, 29. https://doi.org/10.1214/aoms/1177706717