Statistical Significance Tests on Correlation Coefficients

4 min readOct 25, 2013

Originally published at my old Wordpress blog.

Recently, I had to determine whether two calculated correlation coefficient are statistically significantly different from each other. Basically, there exist two types of scenarios: (i) You want to compare two dependent correlations or (ii) you want to compare two independent correlations. I now want to cover both cases and present methods of determining the statistical significance.

Two independent correlations

This use case applies when you have two correlations that come from different samples and are independent to each other. An example would be that you want to know whether height and weight are correlated in the same way for two distinct social groups. The following figure illustrates such a case:

We could think of that X and Y represent height and weight for the first social group and A and B represent it for the second social group. So we have no overlapping correlations and hence, no dependent correlations and we can focus on an independent significance test.

Two dependent (overlapping) correlations

The — in my view — much more interesting case is when you want to determine statistical significance between two dependent correlations. To give you an example, I want to describe a previous use case I was presented with. I was working with calculating semantic relatedness scores between concepts. However, it is very difficult to find a good way of evaluating the scores you produce. But, there is one widely used evaluation dataset called WordSimilarity-353. This gold standard consists of 353 word pairs where you know corresponding relatedness scores which have been judged by humans. Hence, you use the same word pairs and calculate semantic relatedness scores using your method. Finally, you compute the correlation coefficient between both vectors which represents the accuracy of your method. However, there exists a large array of well-performing methods. One of the best ones, for example, is called ESA [1] which ends up with a Spearman rank correlation of 0.75. Now, suppose your method receives a correlation score of 0.76 and you want to judge whether there are statistic significant improvements in your results. This might be a legit question especially due to the small gold standard dataset. As both methods are calculating correlation coefficients against the same gold standard, we have to cope with dependent correlations. This is visualized in the following figure:

We could think of X being the WordSimilarity-353 gold standard, Y being our results and Z being those by the ESA method. We are interested whether the correlation score XY (representing the accuracy of our method) is statistically significantly different to XZ (representing the accuracy of the ESA method). However, for calculating this, we also need to know the correlation coefficient between Y and Z.

Methods

In the past, I came around two ways of calculating the statistical significances for abovementioned cases. The first one is described in detail in the book Statistical Methods for Psychology [2]. It represents two solutions for both the independent and dependent case. For the independent case, one basically uses Fisher’s z-transformation for correlation coefficients [3] and then tests the null hypothesis that p1 — p2 = 0. The dependent case, is a bit more complicated. Yet, the book represents one method by Steiger [4] which incorporates a term describing how the two tests are themselves correlated. There exists a working implementation for both methods in form of an R package.

The second method is by G. Zou [5] and represents as well methods for both the dependent and independent case. The advantage of this method is the acknowledgement of the asymmetry of sample distributions for single correlations and it only requires confidence intervals. The results lead to confidence intervals where one can reject the null hypothesis of no difference if the interval does not include zero. There is R code for this method available.

As I am mainly working with Python, I tackled the lack of a Python implementation for all methods. Hence, I put up a working Python script online that builds upon the abovementioned citations and R codes. I hope, this will help someone and if questions come up, please fell free to ask them here or on the Github page. One needs to note though, that even though we can compare two correlation coefficients, does not necessarily mean that is is a good idea and it may depend strongly on the use case. For a short discussion about this topic, I want to refer to a blog post.

[1] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using wikipedia-based explicit semantic analysis,” in In proceedings of the 20th international joint conference on artificial intelligence, 2007, pp. 1606–1611.
[2] D. C. Howell, Statistical methods for psychology, Cengage Learning, 2011.
[3] R. A. Fisher, “On the probable error of a coefficient of correlation deduced from a small sample,” Metron, vol. 1, pp. 3–32, 1921.
[4] J. H. Steiger, “Tests for comparing elements of a correlation matrix.,” Psychological bulletin, vol. 87, iss. 2, p. 245, 1980.
[5] G. Y. Zou, “Toward using confidence intervals to compare correlations.,” Psychological methods, vol. 12, iss. 4, pp. 399–413, 2007.

Statistical Significance Tests on Correlation Coefficients

Written by Philipp Singer