On Maximal Information Coefficient: A Modern Approach for Finding Associations in Large Data sets

Rhondene Wint
4 min readJan 12, 2019

--

In this two-part blog post, I will share with you a new way to measure relationships in high-dimensional data.

Part 1

I recently came came across David and Yakir Reshef’s presentation on ‘Detecting Novel Associations in Large Data sets’ (namesake of their paper) at the 2017 Broad Institute Models, Inference and Algorithms meeting (video playlist) where they presented the intuition and applications of their recently developed association metric the maximal information coefficient (MIC) that is based on information theory. They gave such an enlightening and convincing presentation that the following day I read as many articles on MIC that I could find. This article provides an overview of the association metrics, a high-level explanation of MIC, its advantages and disadvantages, and programming packages to implement MIC.

Measuring Associations:

Identifying the kinds of relationships (aka associations) between variables is necessary for gaining insights into our experiments, or to choose the best set of features for making an accurate model since not all features may be sufficiently informative. We compute measures of association to quantify the strength and/or direction (positive or negative) of the relationship between two variables. Quantifying the relationship between variables infers whether they are dependent or independent. A value of zero means no statistical correlation between the variables.

Types of Relationship (R.Wnt)

There are several parametric and non-parametric measures for association, such as Pearson’s R, distance correlation, mutual information, etc. However, some correlation metrics are limited to the kinds of association they can detect, or they make assumptions about the underlying distributions of the variables. For example, Pearson’s R only detects linear dependencies. Thus, two variables can have a Pearson’s R = 0.0 but still be dependent. Given that many biological processes are defined by complex patterns that deviate from linear behaviour, we would to like to use a more versatile association statistical method. Enter the maximal information coefficient.

Maximal Information Coefficient

Maximal information Coefficient (Reshef ,Reshef et al 2011) is an information theory-based measure of association that can capture a wide range of functional and non-functional relationships between variables.

Comparison of MIC to other measures of association (Reshef et al 2011).

Formally, MIC is equal to the coefficient of determination (R2). MIC takes values between 0 and 1, where 0 means statistical independence and 1 means a completely noiseless relationship.

From the formula we can see that MIC(X,Y) is the mutual information between random variables X and Y normalized by their minimum joint entropy. I interpret the MIC as the percent of a variable Y that can be explained by a variable X. In addition to generalizing well over a range of relationships, another useful property of MIC is its equitability. This means that MIC assigns the same score to equally noisy relationships, regardless of the type of relationship. This is good because a lot of times we do not know the distribution of our data or the nature of relationships between variables.

Pros of MIC:

  • Able to capture wide range of linear and non-linear relationships (cubic, exponential, sinusoidal, superposition of functions)
  • Symmetric because it is based on mutual information.
  • Does not make any assumptions about the distribution of the variables.
  • Robust to outliers because of its mutual information foundation.
  • MIC coefficient ranges from [0,1] which makes for ease of interpretation and comparison.

Cons of MIC:

  • Does not report the direction or the type of relationship.
  • Computationally expensive. However, several optimized algorithms for approximating the MIC have been published.
  • Statistical power deficiency: when Simon and Tibshirani compared the performance of MIC with Pearson’s and distance correlation on simulated independent data, they reported that in most cases MIC had less power.
Simon and Tibshirani 2011

Resources for MIC in python and R

If you would like to try out MIC on your data, here is a python package and an R package.

I am going to try MIC for myself in python and share with you my experience in Part 2. Thank you for reading and I hope this was useful.

--

--

Rhondene Wint

Computational biologist. Learning and sharing spark joy.