Sitemap

How to Measure Relationships Between Variables: Simplified Methods

6 min readMay 10, 2024
Press enter or click to view image in full size

Hello, I hope you’re doing well. In this article, I aim to provide an simple and intuitive guide to measuring the relationships among variables.

Correlations are crucial for analysis for several reasons:

  1. Pattern Recognition: Correlations reveal underlying patterns and relationships in data.
  2. Prediction: Correlations aid in forecasting future outcomes.
  3. Decision-Making: They inform decision-making processes across various domains.
  4. Risk Management: In finance, correlations help manage portfolio risk.
  5. Research: Correlations are crucial for hypothesis testing and exploring relationships between variables.

Let’s delve into an example to illustrate this concept clearly. Consider a dataset with five features: Date of Birth, Education, Years of Experience, Name of City, and Distance from Office.

Among these features, it’s evident that there exists a relationship between the Name of City and the Distance from Office. After all, the distance implicitly reveals information about the city where the person resides. Consequently, we can eliminate one of these features without significantly compromising the dataset’s integrity.

This reduction in dimensionality proves invaluable, particularly for handling large datasets efficiently. Now, let’s explore few simple methods for measuring these relationships:

  • Co-variance
  • Pearson correlation coeficient (PCC)
  • Spearman rank correlation coeficient (SRCC)

Co-variance:

We have two random variables heights and weights of students.

Press enter or click to view image in full size

So, looking into the dataset, does it means the variable weight have some relationship with weights because typically it has some relationship as in most of the cases as height increases weight also increasing but how do you measure that? Here comes the measure of co variance.

Covariance measures the degree to which two variables change together. A positive covariance indicates that as one variable increases, the other tends to increase as well, and vice versa for negative covariance.

Here, n- total number of variables, Xi and Yi are features and Xˉ and Yˉ are mean values. Basically the formula Covariance of (X,X) will be Variance of (X). Example,

Press enter or click to view image in full size

Here if you see,

When both X and Y increase or decrease simultaneously, we observe a positive covariance. Conversely, when X increases while Y decreases, or vice versa, we encounter a negative covariance, as per the formula.

Covariance is influenced by the choice of units of measurement. For instance, the covariance between X (in cm) and Y (in kg) will differ from that of X (in feet) and Y (in lbs), despite representing the same dataset.

While the sign of covariance informs us about the direction of the relationship between variables, it doesn’t quantify the strength of correlation. This aspect is addressed in the next method, known as Pearson Correlation Coefficient (PCC).

PCC:

The Pearson Correlation Coefficient (PCC) formula is used to quantify the linear relationship between two continuous variables, X and Y. It ranges from -1 to +1, where:

  • +1 indicates a perfect positive linear relationship,
  • -1 indicates a perfect negative linear relationship, and
  • 0 indicates no linear relationship.
  • Values between +1 and -1 indicate the degree of correlation.

There is a slight change in the formula for calculating the Pearson correlation coefficient compared to the covariance formula.

The PCC value for some example of datapoints,

Press enter or click to view image in full size

One drawback of the Pearson Correlation Coefficient (PCC) is that it only measures linear relationships between variables. This means that if the relationship between two variables is not linear, the PCC may not accurately capture the strength or direction of the relationship.

Press enter or click to view image in full size

Lastly, like covariance, the PCC is also affected by the scale of measurement of the variables. Therefore, the correlation coefficient can change if the units of measurement are changed, which may not always be desirable or interpretable in all contexts.

SRCC:

PCC is very effective when there is a linear relationship between variables, but it has limitations when it comes to handling non-linear relationships. These limitations are addressed more gracefully in SRCC. SRCC, or Spearman’s Rank Correlation Coefficient, handles non-linear relationships better than PCC by assessing the monotonic relationship between variables rather than strictly linear ones. It’s important to note that while SRCC is more robust with non-linear relationships, it may still not accurately capture all types of curves, such as sin curves.

For this shape of “monotonically non decreasing” can be handled by SRCC in this example.

Instead of directly calculating the PCC on the data points (X and Y), SRCC calculates the correlation based on the ranks of these variables. This method is less affected by the scale of the variables. For example, consider the following ranking:

So it doesn’t matter whether it’s linear or not; as long as it’s strictly increasing or not, SRCC will calculate the relationship easily. In the example below, if you observe, excluding a few outliers, the trend was clearly increasing. However, PCC yielded a lower value due to those outliers, whereas SRCC handled it well, providing a higher value.

Another important aspect Correlation refers to the relationship between 2 variables, where changes in one variable are associated with changes in the other. When 2 variables are correlated, they tend to move in a particular direction, either positively or negatively. It is essential to mention that correlation does not imply causation but merely signifies an association between variables.

Press enter or click to view image in full size

Causality, conversely, delineates a cause-and-effect correlation between two variables. In a causal context, alterations in one variable directly induce changes in the other. Validating causality extends beyond recognizing a correlation. It necessitates unequivocally showcasing that one variable impacts the other, excluding external factors or mere happenstance as the reason behind the observed connection.

In essence, correlation and causality are intertwined yet distinct concepts. While correlation elucidates the linkage between two variables, causality establishes a direct cause-and-effect dynamic.

Thank you for taking the time to read this article. Your engagement and interest are greatly appreciated. please consider sharing it with others who may benefit from it. Your support means a lot.

Feel free to follow me on Medium or connect with me on LinkedIn. I look forward to continuing this journey of learning and growth together. Keep exploring, keep learning, and keep growing. Until next time!

--

--

Abraham Vensaslas
Abraham Vensaslas

Written by Abraham Vensaslas

Passionate about ML and DataScience

No responses yet