Machine Learning

How to measure the “non linear correlation” between multiple variables?

A presentation of the information and prediction scores

Gabriel de Longeaux

Published in

Analytics Vidhya

9 min readAug 20, 2020

You can now listen to stories

This article aims to present two ways of calculating non linear correlation between any number of discrete variables. The objective for a data analysis project is twofold : on the one hand, to know the amount of information the variables share with each other, and therefore, to identify whether the data available contain the information one is looking for ; and on the other hand, to identify which minimum set of variables contains the most important amount of useful information.

The different types of relationships between variables

Linearity

The best-known relationship between several variables is the linear one. This is the type of relationships that is measured by the classical correlation coefficient: the closer it is, in absolute value, to 1, the more the variables are linked by an exact linear relationship.

However, there are plenty of other potential relationships between variables, which cannot be captured by the measurement of conventional linear correlation.

Correlation between X and Y is almost 0%

To find such non-linear relationships between variables, other correlation measures should be used. The price to pay is to work only with discrete, or discretized, variables.

In addition to that, having a method for calculating multivariate correlations makes it possible to take into account the two main types of interaction that variables may present: relationships of information redundancy or complementarity.

Redundancy

When two variables (hereafter, X and Y) share information in a redundant manner, the amount of information provided by both variables X and Y to predict Z will be inferior to the sum of the amounts of information provided by X to predict Z, and by Y to predict Z.

In the extreme case, X = Y. Then, if the values taken by Z can be correctly predicted 50% of the times by X (and Y), the values taken by Z cannot be predicted perfectly (i.e. 100% of the times) by the variables X and Y together.

                            ╔═══╦═══╦═══╗
                            ║ X ║ Y ║ Z ║
                            ╠═══╬═══╬═══╣
                            ║ 0 ║ 0 ║ 0 ║
                            ║ 0 ║ 0 ║ 0 ║
                            ║ 1 ║ 1 ║ 0 ║
                            ║ 1 ║ 1 ║ 1 ║
                            ╚═══╩═══╩═══╝

Complementarity

The complementarity relationship is the exact opposite situation. In the extreme case, X provides no information about Z, neither does Y, but the variables X and Y together allow to predict perfectly the values taken by Z. In such a case, the correlation between X and Z is zero, as is the correlation between Y and Z, but the correlation between X, Y and Z is 100%.

These complementarity relationships only occur in the case of non-linear relationships, and must then be taken into account in order to avoid any error when trying to reduce the dimensionality of a data analysis problem: discarding X and Y because they do not provide any information on Z when considered independently would be a bad idea.

                            ╔═══╦═══╦═══╗
                            ║ X ║ Y ║ Z ║
                            ╠═══╬═══╬═══╣
                            ║ 0 ║ 0 ║ 0 ║
                            ║ 0 ║ 1 ║ 1 ║
                            ║ 1 ║ 0 ║ 1 ║
                            ║ 1 ║ 1 ║ 0 ║
                            ╚═══╩═══╩═══╝

Two possible measures of “multivariate non-linear correlation”

There is a significant amount of possible measures of (multivariate) non-linear correlation (e.g. multivariate mutual information, maximum information coefficient — MIC, etc.). I present here two of them whose properties, in my opinion, satisfy exactly what one would expect from such measures. The only caveat is that they require discrete variables, and are very computationally intensive.

Symmetric measure

The first one is a measure of the information shared by n variables V1, …, Vn, known as “dual total correlation” (among other names).

This measure of the information shared by different variables can be characterized as:

where H(V) expresses the entropy of variable V.

When normalized by H(V1, …, Vn), this “mutual information score” takes values ranging from 0% (meaning that the n variables are not at all similar) to 100% (meaning that the n variables are identical, except for the labels).

This measure is symmetric because the information shared by X and Y is exactly the same as the information shared by Y and X.

The Venn diagram above shows the “variability” (entropy) of the variables V1, V2 and V3 with circles. The shaded area represents the entropy shared by the three variables: it is the dual total correlation.

Asymmetric measure

The symmetry property of usual correlation measurements is sometimes criticized. Indeed, if I want to predict Y as a function of X, I do not care if X and Y have little information in common: all I care about is that the variable X contains all the information needed to predict Y, even if Y gives very little information about X. For example, if X takes animal species and Y takes animal families as values, then X easily allows us to know Y, but Y gives little information about X:

    ╔═════════════════════════════╦══════════════════════════════╗
    ║ Animal species (variable X) ║ Animal families (variable Y) ║
    ╠═════════════════════════════╬══════════════════════════════╣
    ║ Tiger                       ║ Feline                       ║
    ║ Lynx                        ║ Feline                       ║
    ║ Serval                      ║ Feline                       ║
    ║ Cat                         ║ Feline                       ║
    ║ Jackal                      ║ Canid                        ║
    ║ Dhole                       ║ Canid                        ║
    ║ Wild dog                    ║ Canid                        ║
    ║ Dog                         ║ Canid                        ║
    ╚═════════════════════════════╩══════════════════════════════╝

The “information score” of X to predict Y should then be 100%, while the “information score” of Y for predicting X will be, for example, only 10%.

In plain terms, if the variables D1, …, Dn are descriptors, and the variables T1, …, Tn are target variables (to be predicted by descriptors), then such an information score is given by the following formula:

where H(V) expresses the entropy of variable V.

This “prediction score” also ranges from 0% (if the descriptors do not predict the target variables) to 100% (if the descriptors perfectly predict the target variables). This score is, to my knowledge, completely new.

Share of entropy of D1 and D2 useful to predict T1

The shaded area in the above diagram represents the entropy shared by the descriptors D1 and D2 with the target variable T1. The difference with the dual total correlation is that the information shared by the descriptors but not related to the target variable is not taken into account.

Computation of the information scores in practice

A direct method to calculate the two scores presented above is based on the estimation of the entropies of the different variables, or groups of variables.

In R language, the entropy function of the ‘infotheo’ package gives us exactly what we need. The calculation of the joint entropy of three variables V1, V2 and V3 is very simple:

library(infotheo)df <- data.frame(V1 = c(0,0,1,1,0,0,1,0,1,1),                 V2 = c(0,1,0,1,0,1,1,0,1,0),                 V3 = c(0,1,1,0,0,0,1,1,0,1))entropy(df)[1] 1.886697

The computation of the joint entropy of several variables in Python requires some additional work. The BIOLAB contributor, on the blog of the Orange software, suggests the following function:

import numpy as np
import itertools
from functools import reducedef entropy(*X):    entropy = sum(-p * np.log(p) if p > 0 else 0 for p in
        (np.mean(reduce(np.logical_and, (predictions == c for predictions, c in zip(X, classes))))
        for classes in itertools.product(*[set(x) for x in X])))    return(entropy)
V1 = np.array([0,0,1,1,0,0,1,0,1,1])V2 = np.array([0,1,0,1,0,1,1,0,1,0])V3 = np.array([0,1,1,0,0,0,1,1,0,1])entropy(V1, V2, V3)1.8866967846580784

In each case, the entropy is given in nats, the “natural unit of information”.

For a high number of dimensions, the information scores are no longer computable, as the entropy calculation is too computationally intensive and time-consuming. Also, it is not desirable to calculate information scores when the number of samples is not large enough compared to the number of dimensions, because then the information score is “overfitting” the data, just like in a classical machine learning model. For instance, if only two samples are available for two variables X and Y, the linear regression will obtain a “perfect” result:

                            ╔════╦═════╗
                            ║ X  ║  Y  ║
                            ╠════╬═════╣
                            ║  0 ║ 317 ║
                            ║ 10 ║  40 ║
                            ╚════╩═════╝

Similarly, let’s imagine that I take temperature measures over time, while ensuring to note the time of day for each measure. I can then try to explore the relationship between time of day and temperature. If the number of samples I have is too small relative to the number of problem dimensions, the chances are high that the information scores overestimate the relationship between the two variables:

                ╔══════════════════╦════════════════╗
                ║ Temperature (°C) ║ Hour (0 to 24) ║
                ╠══════════════════╬════════════════╣
                ║               23 ║             10 ║
                ║               27 ║             15 ║
                ╚══════════════════╩════════════════╝

In the above example, and based on the only observations available, it appears that the two variables are in perfect bijection: the information scores will be 100%.

It should therefore be remembered that information scores are capable, like machine learning models, of “overfitting”, much more than linear correlation, since linear models are by nature limited in complexity.

Example of prediction score use

The Titanic dataset contains information about 887 passengers from the Titanic who were on board when the ship collided with an iceberg: the price they paid for boarding (Fare), their class (Pclass), their name (Name), their gender (Sex), their age (Age), the number of their relatives on board (Parents/Children Aboard and Siblings/Spouses Aboard) and whether they survived or not (Survived).

This dataset is typically used to determine the probability that a person had of surviving, or more simply to “predict” whether the person survived, by means of the individual data available (excluding the Survived variable).

So, for different possible combinations of the descriptors, I calculated the prediction score with respect to the Survived variable. I removed the nominative data (otherwise the prediction score would be 100% because of the overfitting) and discretized the continuous variables. Some results are presented below:

Purely illustrative example — results depend on the discretization method

The first row of the table gives the prediction score if we use all the predictors to predict the target variable: this score being more than 80%, it is clear that the available data enable us to predict with a “good precision” the target variable Survived.

Cases of information redundancy can also be observed: the variables Fare, PClass and Sex are together correlated at 41% with the Survived variable, while the sum of the individual correlations amounts to 43% (11% + 9% + 23%).

There are also cases of complementarity: the variables Age, Fare and Sex are almost 70% correlated with the Survived variable, while the sum of their individual correlations is not even 40% (3% + 11% + 23%).

Finally, if one wishes to reduce the dimensionality of the problem and to find a “sufficiently good” model using as few variables as possible, it is better to use the three variables Age and Fare and Sex (prediction score of 69%) rather than the variables Fare, Parents/Children Aboard, Pclass and Siblings/Spouses Aboard (prediction score of 33%). It allows to find twice as much useful information with one less variable.

Calculating the prediction score can therefore be very useful in a data analysis project, to ensure that the data available contain sufficient relevant information, and to identify the variables that are most important for the analysis.