An overview of correlation measures between categorical and continuous variables

Outside Two Standard Deviations
14 min readSep 13, 2018

The last few days I have been thinking a lot about different ways of measuring correlations between variables and their pros and cons. Here’s the problem: there are two kinds of variables — continuous and categorical (sometimes called discrete or factor variables) and hence, we need a single or different metrics which can quantify correlation or association between continuous-continuous, categorical-categorical and categorical-continuous variable pairs. Computing correlation can be broken down into two sub-problems — i). Testing if there is a statistically significant correlation between two variables and ii). Quantifying the association or ‘goodness of fit’ between the two variables. Ideally, we also need to be able to compare such goodness of fit metrics between variable pair classes on some universal scale. This problem becomes important if the matrix you are analyzing has a combination of categorical and continuous variables. In these cases, if you want a universal criterion to drop columns above a certain correlation from further analyses, it is important that all correlations computed are comparable. There is no single technique to correlate all the three variable pairs and so having such a universal scale for comparing correlations obtained from different methods is tricky and needs some thinking.

You might be wondering why anyone would ever need to compare correlation metrics between different variable types. In general, knowing if two variables are correlated and…

--

--

Outside Two Standard Deviations

A blog about things in AI, healthcare and biotechnology. Things outside two standard deviations :)