Correlation vs. Regression: A Key Difference That Many Analysts Miss

Correlation and regression have many similarities and can often both be applied to the same data. The key quantities (r and b, respectively) are interpreted differently, but what many analysts miss is that the two actually tell us substantively different pieces of information about our data.

John V. Kane
The Stata Gallery

--

Correlation analysis (specifically, Pearson’s pairwise correlation) and regression analysis (specifically, bivariate ordinary least squares (OLS) regression) have many features in common:

  • Both are regularly applied to two continuous variables (let’s call these X and Y).
  • Both are often introduced to students using the same type of graph: a scatterplot.
  • Both are fundamentally about how deviations (that is, individual values in relation to the mean) in X are associated with deviations in Y.
  • Both assume a linear relationship between X and Y.
  • Both can be used for classical hypothesis testing, each relying on the same underlying distribution (t) and producing identical p-values.

Indeed, the popular R-squared that is obtained in bivariate OLS regression is literally just the Pearson’s correlation coefficient (r) squared.

It’s therefore not surprising that many analysts often use correlation and bivariate regression interchangeably. So what’s the difference?

Differences: Interpretive, Yes, but also Substantive

Analysts know the big difference is how we interpret the key quantities that each analysis produces. The correlation coefficient (r) that we obtain from correlation analysis is a standardized number, falling somewhere on a -1 to +1 scale (where -1 indicates a perfectly negative linear correlation, while +1 indicates a perfectly positive linear correlation) regardless of the variables we’re analyzing.

Regression, on the other hand, produces a beta coefficient (b), which can be any number, and which tells us the average change in Y given a one-unit increase in X. In other words, b is in units of the specific Y variable we are studying. As such, to make any substantive sense of b, we really need to know the details about what X and Y are and how they’re being measured.

Thus it would seem that the only real difference is in how detailed we want to get: r is on a nice, standardized scale but is a bit vague, while b is more of a mouthful to interpret but is more specific with respect to our variables.

But within these different interpretations is a much more important conceptual difference between the two — one that analysts can easily miss:

  • r is about how tightly the observations cluster around a fit line, regardless of how steep that line is.
  • b is about how steep that fit line is, regardless of how tightly the observations cluster around it.

In other words, r is really concerned with the consistency with which higher values of X tend to (linearly) correspond with higher (or lower) values of Y. Yet, b is really concerned with how much Y is expected to change, on average, given an increase in X.

In short, the clustering of observations around a slope is not the same as the slope itself.

Why is this important? If we assume, for the moment, a causal effect of X on Y, what this means is that, hypothetically, X could be strongly correlated with Y, and yet also have only a negligible effect on Y.

For example, imagine a very slightly positive slope that has all of our observations tightly clustered around it. Though r and b will both be positively signed, r will be high, yet b will be low. A relatively strong correlation, yet only a weak effect.

And of course the inverse can occur: a very steep slope, but with observations falling very far above and below the fit line. Though they will both be the same sign, r will be low, yet b will be high. A relatively weak correlation, and yet a strong effect.

A Simulated Example Using Stata

To see this a bit more concretely, we will generate some fake data (n=100) using Stata:

# Simulate data
clear
set obs 100
set seed 321

matrix C = (1, .7 \ .7, 1)

corr2data x y, means(50 70) corr(C)

gen z=x

sum x y z

#Modify z variable
replace z=z+5 if y>70
replace z=z-5 if y<70
replace z=z+1 if x>51
replace z=z-1 if x<49
replace z=z-2 if x<51 & x>49 & z<50
replace z=z+2 if x<51 & x>49 & z>50
replace z=z+1 if x<49
replace z=z-3 if x>=50.4 & x<=50.6 & z>50
replace z=z-2 if x<=49.4 & z>50
replace z=z+3 if x>=50.4 & x<=50.6 & z<50
replace z=z+2 if x>=48.5 & x<=49 & z<50
replace z=z+2 if x>=48.1 & x<=48.5 & z<50

We’ve now created three variables: X, Y, and Z. Let’s say we want to know how X relates to both Y and Z. We can look at the correlation and regression coefficients (see code annotations toward the top), and then make a graph:

# Examine correlation and regression coefficients

pwcorr x y z // x&y: r=.70; x&z: r=.61
reg y x // b=.70
reg z x // b=4.04

# Make graph
scatter y x, msize(large) mcolor(stgreen%80) mlcolor(lime) mlwidth(medium) jitter(10) ///
xlab( , glpattern(solid) glcolor(gs14) glwidth(thin)) ///
ylab( , glpattern(solid) glcolor(gs14) glwidth(thin)) ///
|| lfit y x, lwidth(medthick) lcolor(red%80) || ///
scatter z x, msize(large) mcolor(stblue%80) mlcolor(cyan) mlwidth(medium) jitter(10) || ///
lfit z x, lcolor(stc10%90) lwidth(medthick) ///
legend(order (1 "Y Variable" 2 "{&beta}=.70" 3 "Z Variable" 4 "{&beta}=4.04")) ///
legend(region(fcolor(gs15) lcolor(gs10))) ///
xsize(6.5) ysize(4.5) graphregion(margin(vsmall)) ///
title(Correlation vs. Regression, box fcolor(black) color(white) span bexpand) ///
scheme(white_jet) /// note: requires installing "schemepack"
text(75 48.5 "{stSans: {it:r}} = .70", placement(c) justification(left) size(medium) box fcolor(edkblue) color(white)) ///
text(50 48.5 "{stSans: {it:r}} = .61", placement(c) justification(left) size(medium) box fcolor(edkblue) color(white)) ///
xtitle(X Variable)

We then get the following graph:

The (green) dots toward the top of the graph show the relationship between X and Y. The r value is .70. This relationship also has a b value of .70. Thus, a reasonably strong, positive correlation, and an “effect” of .70, meaning that when X increases by 1, we expect Y to increase (on average) by .70. This effect is represented by the (red) fit line going through the data points.

But now look at the (blue) dots in the bottom half. It looks quite messy — they do not look nearly as neatly clustered around the (orange) fit line as in the top half of the graph. Relative to the top graph, they are widely dispersed around the fit line, indicating that there are relatively more instances in which going from one value of X to a higher value of X corresponds with a lower value of Y, and vice versa. As a consequence, our correlation (r) is lower, now only .61 compared to the .70 on top.

But now notice the relative steepness of the slope on the bottom. Increasing the value of X carries with it quite a large change, on average, in the expected value of Y. Thus, we see a remarkably large slope: b=4.03. This means that for every one-unit increase in X, we see, on average, an increase of 4.03 in Y. That is a substantially steeper slope than the b=.70 effect we see on top, and yet the correlation (r) is weaker on the bottom.

Again, how well the data points cluster around a slope is not the same thing as the slope itself.

Implications for Assessing Substantive/Practical “Significance”

This distinction becomes especially important when we have to decide how to communicate the “substantive” (aka, “practical”) significance of our result. We have simple rules of thumb for doing this with correlation: an r value of +/- .60 or larger is considered a “strong” relationship, while an r value of +/- .30 or less is considered a “weak” relationship. So it might be tempting to simply default to r when talking about substantive significance.

Yet as I demonstrated above, r might tell a misleading story about the substantive significance, especially when what we really care about is an effect size. The latter is better communicated by b, not r.

Similarly, analysts often like to compare across groups. As such, we might be tempted to compare r values across different groups to find which group exhibits the stronger relationship (e.g., looking at the relationship between education and income across genders).

Yet here, again, we could come away with a misleading conclusion: to find that group A has a stronger correlation between X and Y than group B, does not necessarily mean that X has a larger effect on Y in group A than in group B. The opposite could well be true.

Conclusion

Correlation and regression analysis are not merely different ways of saying the same thing. The hypothetical examples featured in this article illustrate a key conceptual difference between correlation and regression — that is, the difference between the clustering of observations around a slope vis-a-vis the slope itself.

In real-world data analysis, though, r and b will usually tell us essentially similar stories about our data. That is, most of the time, we probably won’t come away with dramatically different impressions of how X and Y are related, regardless of whether we use correlation or bivariate regression.

Nevertheless, there’s no guarantee: depending on the specific data we’re looking at, r and b may tell us somewhat different stories (e.g., that there’s a strong correlation yet a fairly weak effect). As such, we need to be mindful of this key conceptual difference between r and b.

What to do? In practice, it’s important to think about which quantity — r or b — is more theoretically important; more capable of answering the specific question we are asking. This criterion should be what primarily guides our choice about whether to report r or b.

Of course, if we can’t decide, or if both seem relevant to the question at hand, then simply report both and let the reader decide 😊

About the Author

John V. Kane is Clinical Associate Professor at the Center for Global Affairs and an Affiliated Faculty member of NYU’s Department of Politics. He received his Ph.D. in political science and his primary research interests include public opinion, political psychology, and experimental research methodology. His research has been published in a variety of top-ranking peer-reviewed journals, including the American Political Science Review, American Journal of Political Science, the Journal of Politics, and the Journal of Experimental Political Science. His research has been featured in numerous media outlets, including The New York Times, The Washington Post, and National Public Radio. He has taught graduate courses on political psychology, research methods, statistics and data analysis, and has also received teaching excellence awards from both New York University and Stony Brook University. His website is www.johnvkane.com. You can follow him on X/Twitter, ResearchGate, LinkedIn, and/or BlueSky Social.

--

--

John V. Kane
The Stata Gallery

John V. Kane is an Associate Professor at NYU's Center for Global Affairs. He researches political attitudes & experimental methods. Twitter: @UptonOrwell