Correlation in machine learning — All you need to know

7 min readSep 22, 2023

What is correlation?

Correlation is a key statistical concept that researchers employ to analyze connections within their data. It helps us to Understand the Relationship Between Variables

In research, we often study how different factors relate to each other. The connection between two or more variables is known as their correlation. Correlation refers to the degree to which the variables change together or co-vary.

It is not enough to simply examine how one variable increases or decreases independently. Correlation looks at the simultaneous fluctuations in both or all variables measured. A high correlation indicates the variables tend to move in tandem. A low correlation means the variables are not closely associated in their fluctuations.

Knowing the correlation helps uncover important relationships between elements we are investigating. It provides insight into how changes in one variable may correlate with or predict changes in another. As researchers we rely on correlation to better understand the links between different phenomena.

The correlation coefficient quantifies the strength and direction of the correlation. Values closer to 1 or -1 represent stronger correlations, while those closer to 0 indicate little connection between the variables.

Why correlation is important for machine learning?

It is important for machine learning engineers to understand the correlation between variables in their models for several key reasons:

1. Feature selection

which is the process of choosing which variables or features to use in the model. Highly correlated features provide redundant information, so feature selection aims to remove uninformative features to simplify models.

By analyzing correlations, researchers can identify redundant features and select a minimal set of important features that best represent the target variable. This prevents overfitting and improves a model’s ability to generalize. Feature selection using correlation analysis helps machine learning engineers build more accurate and efficient models by focusing only on the most informative variables correlated to the predicted output.

2. Reduce Bias

Correlation analysis is also important for ensuring model fairness and avoiding bias. When certain features are highly correlated with sensitive attributes like gender or ethnicity, it can inadvertently encode biases into machine learning models if not properly addressed.

For example, features related to a person’s neighborhood may correlate with their race due to historical segregation. If a model relies too heavily on these correlated features, it risks discriminating against or disadvantaging certain groups. By identifying correlations between input features and sensitive attributes, machine learning engineers can evaluate models for potential biases, monitor feature importance, and apply techniques like fair representation learning to mitigate bias. Understanding feature correlations helps create more just, inclusive models that treat all individuals with equality and make predictions based solely on merit rather than attributes outside of a person’s control. This helps fulfill the promise of AI to benefit all of humanity.

3. Multicollinearity

Another important aspect of analyzing feature correlations is detecting multicollinearity. Multicollinearity occurs when two or more predictor variables in a model are highly linearly correlated with each other. It can negatively impact models by increasing variance and making it difficult to determine the significance and effect of individual predictors.

Variables with high multicollinearity provide redundant information, similar to how correlated features do. However, multicollinearity is more problematic because it inflates standard errors and undermines reliability of estimated coefficients. By examining correlation matrices and variance inflation factors, machine learning practitioners can identify cases of multicollinearity between input features. This allows them to address multicollinearity through techniques such as principal component analysis or ridge regression to improve model stability and interpretability. Understanding correlations is crucial for diagnosing and mitigating the adverse effects of multicollinearity on predictive modeling.

4. Interpretability and Debugging

Understanding correlations also aids in interpreting machine learning models. As models become increasingly complex with many interacting variables, it can be difficult to explain why a model makes certain predictions.

By analyzing the correlation between input features and output targets, researchers gain insights into which variables have the strongest impact on the model’s decisions. This helps ensure the model is actually learning meaningful patterns in the data rather than spurious correlations. Knowing feature correlations further assists in debugging models that perform poorly. It allows engineers to identify any features that may be overwhelming the model or causing unintended biases. In summary, correlation analysis provides crucial information for building transparent, robust machine learning systems that practitioners can have confidence in deploying.

Measures of Correlation

1. Pearson’s correlation coefficient

To calculate Pearson’s r, a line of best fit is determined for the two variables using linear regression. This regression line represents the linear relationship that best predicts the values of one variable based on the other. The correlation coefficient is then computed based on how far each data point deviates from this regression line. Data points that lie exactly on the line have a deviation of zero, while points farther away have higher deviations. Pearson’s r factors in both the direction and magnitude of all these deviations to produce a measure between -1 and 1, indicating the overall linear association between the variables. A value closer to the extremes represents less deviation and stronger linear correlation, while a value near zero suggests the data are poorly described by a linear relationship.

So simply, the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit.

Pearson’s r can correlate variables measured on different scales. For example, it could assess the relationship between average temperature (measured in degrees Celsius) and number of ice cream sales (measured as a daily count).
As a dimensionless index, r is unaffected by the original measurement units or scale. Whether temperature was in Fahrenheit or sales in dollars wouldn’t alter the correlation value.
The correlation computed remains identical regardless of how the variables are labeled (e.g. as independent or dependent). Examining if sales drive temperature changes versus temperature influencing sales would generate the same r result, as the coefficient considers only covariation between paired observations.

2. Spearman’s correlation coefficient

Spearman’s, rs , assesses how well an arbitrary monotonic function describes the relationship between two variables, rather than specifically testing for a linear association. A monotonic relationship is one where as one variable increases or decreases, the other variable consistently increases or decreases. This allows Spearman’s correlation to identify nonlinear relationships that may not be evident when using Pearson’s r. It does so by first ranking all data points in each variable from smallest to largest value, then calculating r using these ranks rather than the original measurements. As such, Spearman’s rs is more flexible and can identify more complex monotonic correlations beyond linear trends alone

comparing Pearson and Spearman correlations

The main difference between Pearson’s r and Spearman’s rs is that r only considers linear relationships, while rs detects monotonic associations between variables regardless of whether they follow a straight line pattern. Pearson’s is more appropriate when the variables are expected to have a linear relationship. Spearman’s is preferable when the relationship may be nonlinear but still consistently increasing or decreasing. Another key distinction is that Pearson’s requires continuous variables, whereas Spearman’s can handle ordinal data as well. So Spearman’s correlation offers a more general approach at the cost of being less sensitive for linear trends compared to Pearson’s r.

The Limitations of Correlation

While correlation analysis is useful for identifying relationships between variables, it is important to note that correlation does not necessarily imply causation. Simply because two factors vary together based on the available data does not mean that one factor causes changes in the other. There could be some third, underlying variable influencing both.

For example, sales of ice cream and number of shark attacks are correlated, but ice cream consumption clearly does not cause shark attacks. The true causal relationship may go in the opposite direction or involve other external drivers like hot weather. Machine learning models based solely on correlational data can suggest spurious causal links. More rigorous experimentation and domain expertise are needed to establish credible causal explanations. Understanding the distinction between correlation and causation helps machine learning practitioners avoid making unfounded causal claims and build more robust models that account for complex real-world phenomena.

Example

Heatmap correlation matrix for Facebook ad campaign

Analysing the performance data from a recent Facebook ad campaign promoting a new product launch. we found that the total conversions, which refer to the number of people who clicked on the ad and visited the product page to inquire more about it, were highly correlated with the approved conversions, or the number of people who ultimately purchased the product after seeing the ad. This indicates that when more people were driven to learn about the product through the ad, it was more likely to translate into actual sales and revenue for the company. The data suggests the campaign was effective at generating interest from viewers and converting that initial interest into real customers.

We initially attempted to include the total number of conversions as a feature during model training. However, we found that this led to overfitting, with the model simply learning to optimize for total conversions rather than learning meaningful patterns from the data. To address this overfitting issue, we dropped the total conversions feature and relied only on the other behavioral and contextual features. This change allowed the model to better generalize without simply optimizing for a single metric like total conversions. While total conversions remained the ultimate business goal, removing it as a direct training feature helped the model learn patterns that drove approved conversions rather than just maximizing a single number.