Estimating Correlations through Constant Rate or Monotonicity. But what is the Question?

Freedom Preetham
Mathematical Musings
6 min readOct 20, 2022

When you come across correlation relationships, you quite often stumble upon wanting to know how your independent variable is affecting your dependent variable. But the trick is to check if you are looking for answers to the right question.

While many definitions are available for correlation coefficients, I haven't seen many explanations driving the definitions by focusing on “What is the right question to ask?”. Hence, I have taken a question-first approach to help develop the right intuition to choose the correct statistical model.

As a quick refresher, there are two ways to check correlation relationships. The Pearson Coefficient and the Spearman’s Coefficient.

But Spearman and Pearson give you two different numbers for the same data on the graph, as you notice, and you wonder why? This is exactly why you should understand the foundations and intuition behind the statistical tool and the question we are trying to ask.

First, let’s develop the proper statistical context to understand the nature of the question we are asking for the data to reveal its characteristics.

There are three different questions you can ask, and each of them has different approaches to take. Let me summarize them first.

  1. For each change in the independent variable, what is the average change in the dependent variable?
  2. What proportion of the variation in the dependent variable is attributable to the independent variable?
  3. How strongly is the dependent variable associated with the independent variable, and in which direction is the movement of data points?

Let’s deep dive into answering them.

1. For each change in the independent variable, what is the average change in the dependent variable?

Here you are looking to predict the value of the dependent variable based on the independent variable. Note that the prediction is closer to a conditional mean (the expected value of a random variable, computed with respect to a conditional probability distribution)!

Simple Linear Regression should be your essential statistical tool to answer the above question. (NOT Correlation)

Your Simple Linear Regression Equation is as follows:

and your slope estimate is computed as follows:

Once you have the slope, you can get the Y-intercept by plugging the mean values of X and Y from your dataset.

2. What proportion of the variation in the dependent variable is attributable to the independent variable?

This CANNOT be provided by either regression or correlation.

The R-Squared of your regression is the correct answer for the above question. Here you are looking for the coefficient of determination, or in other words, you are looking for the extent to which the variance of one variable explains the variance of the other.

3. How strongly is the dependent variable associated with the independent variable, and in which direction is the movement of data points?

This is the only question for which you should use a Correlation Analysis.

Typically you need a statistical model that provides more robust inferences about the data. Parametric statistical models are the best models to do this job. But parametric models require stricter compliance with statistical assumptions. Mainly,

  • Data follows a normal distribution
  • Data demonstrate homogeneity of variance.
  • Data is independently observable with no autocorrelation.

If your data breaches any of these strict requirements, then you resort to non-parametric models.

  • Parametric: Pearson Correlation Coefficient.
  • Non-Parametric: Spearman Correlation Coefficient.

Even if you meet the statistical assumptions of the data, if your independent variable is not continuous, then you cannot use Pearson’s Correlation. You must use Spearman’s here.

The most crucial inference when dealing with correlations.

Apart from the above distinctions, it is essential to keep the following inferences in mind when using correlations.

Pearson: Pearson's model can analyze only the linear relationship between the dependent and independent continuous variables.

This means that a change in one variable has a proportional and constant change in the other.

Visual Inference for Pearson.

Spearman: Spearman correlation evaluates a monotonic relationship between the dependent and independent variables which are either continuous or ordinal.

Monotonicity means that a function is either entirely non-increasing or entirely non-decreasing. Never a constant rate.

Note that the numerator in Spearman’s evaluation looks for the difference between the two ranks of each observation. This is not the case, as we saw in Pearson's evaluation.

Visual Inference for Spearman:

Revisiting the Graph

In this graph, the Pearson Correlation is telling us the strength of the dependent variable associated with the independent variable while Spearman’s coefficient is just effectively telling us that this is strictly monotonically increasing without any variance (That the curve fits well)

You now wonder, when does Spearman’s coefficient have a positive value less than 1? And what would it mean?

Here is a visual inference.

Here Spearman’s correlation is 0.92, which shows that the function of the data is monotonically increasing, but they are not perfect monotones. This happens when there is a variance in the data.

What about outliers?

One of the places where Pearson’s and Spearman’s majorly differ is when the data has outliers.

[From Wikipedia]

When the data are roughly elliptically distributed, and there are no prominent outliers, the Spearman correlation and Pearson correlation give similar values.

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples. That is because Spearman’s ρ limits the outlier to the value of its rank.

Summary

It is very important to understand what question you are asking to use the right statistical correlations model.

QUESTION: For each change in the independent variable, what is the average change in the dependent variable?

ANSWER: Liner Regression

QUESTION: What proportion of the variation in the dependent variable is attributable to the independent variable?

ANSWER: R-Squared

QUESTION: How strongly is the dependent variable associated with the independent variable, and in which direction is the movement of data points?

ANSWER: Correlation Analysis, But,

  • We should know if the data passes the statistical assumptions for parametric evaluation.
  • If the input variables are continuous or ordinal.
  • And if we are looking for rank order correlation or constant rate.

I hope this helps.

--

--