How Much Juice Is There In Your Data?

An Information-Theoretic Answer Applied To Kaggle, With Code

Published in

The Principled Machine Learning Researcher

16 min readAug 20, 2020

97%. That’s the percentage of data that sits unused by organizations according to Gartner, making up so-called “dark data”.

Data has overtaken oil as the world’s most valuable resource, but nearly all of it is still unused by organizations. Gartner estimates that 87% of organizations have “low BI and analytics maturity”.

A possible explanation for this paradox is that not all data are created equal. The value of data can vary drastically from one organization to another, and within the same organization from one project to another.

In order to make the required investment in business intelligence and analytics, organizations should be able to anticipate, to an accurate degree, the business impact of doing so, and such a future investment should be expected to yield a high enough ROI.

Data Cannot Become the New Oil Without The Machine Learning Equivalent of Exploration Geophysics

Assuming that data is always valuable and attempting to extract such value by trial-and-error as part of an AI project, be it driven by an AutoML platform, can be very wasteful at best, and have a negative ROI in the worst-case scenario.

Pursuant to the oil analogy, this would be equivalent to assuming that there is always oil in the ground wherever one looks and that the only factor driving the amount of oil extracted from the ground is the extraction technique one uses.

Over the years, an entire field of study at the intersection of geophysics and economics, namely exploration geophysics, has been devoted to reducing the business risk in oil production. Exploration geophysics relies on inductive reasoning (as opposed to deductive reasoning) to detect the presence of and to estimate the amount of valuable geological deposits at a given location, without incurring the upfront cost and risk of building an exploitation site.

Similarly, in order to reduce the business risk in investing in AI projects, it is crucial to develop inductive reasoning methods for quantifying the value of data, prior to and independently from doing any predictive modeling, a phase we refer to as pre-learning.

Whereas the deductive approach for valuing a dataset would consist of first allocating resources to put the dataset to use and then monitoring the business impact for a while, the inductive approach consists of using mathematical reasoning to infer from the dataset of interest the highest performance that can be achieved by any predictive model, inexpensively, and without having to train any predictive model.

In this article, we summarize the theoretical foundations that enable pre-learning, and we illustrate how to quantify the juice in various datasets using the open-source kxy python package.

What is juice to start with?

The juice in data refers to the amount of information in data that is useful for solving a particular problem at hand.

(Left) Photo by Rinck Content Studio on Unsplash | (Right) Illustration by Instagram user @skwoodlekids

The same way the juice in an orange (or the oil in the ground) exists in its own right, whether or not and however it is extracted, it is important to realize that every dataset can conceptually be thought of as a (possibly empty) part that can be useful for solving the problem at hand, and the useless remainder.

In this respect, two points are worth stressing. First, what’s useful (resp. useless) is problem-specific. The part of a dataset that is useless for solving a specific problem can be useful for solving another problem.

Second, what’s useful in a dataset for solving a given problem is not tied to a specific method for solving the problem. The same way the total amount of juice contained in a given orange is the maximum amount of liquid that can be extracted out of the orange, irrespective of how it is squeezed, the total amount of juice in a dataset is the maximum amount of utility that can be extracted out of the dataset to solve a specific problem, no matter what machine learning model is used to solve the problem.

To lay out the foundations of pre-learning, we need to formalize what we mean by ‘problem’ and ‘useful’.

The problems we are concerned with are classification and regression problems, without restriction on the type of inputs or outputs. Specifically, we consider predicting a business outcome y using a vector of inputs x. The choice of y is intrinsic to the business problem we are interested in solving, while inputs x represent the dataset we are considering using to solve our problem.

As usual, we represent our uncertainty about the values of y and x by modeling them as random variables.¹ Saying that our dataset is useful for solving the problem of interest is equivalent to saying that inputs x are informative about the label/output y.

Fortunately, notions of informativeness and association are fully formalized by Information Theory. You’ll find a gentle primer on information theory here. For the purpose of this article, if we denote h(z) the entropy of a random variable — differential for continuous variable and Shannon for categorical variables, it suffices to recall that the canonical measure of how informative x is about y is their mutual information, defined as

Some Key Properties of Mutual Information

The mutual information I(y; x) is well defined whether the output is categorical or continuous and whether inputs are continuous, categorical, or a combination of both. For some background reading on why this is the case, check out this primer and references therein.

It is always non-negative and it is equal to 0 if and only if y and x are statistically independent (i.e. there is no relationship whatsoever between y and x).

Additionally, the mutual information is invariant by lossless feature transformations. Indeed, if f and g are two one-to-one maps, then

A more general result, known as the Data Processing Inequality, states that transformations applied to x can only reduce its mutual information with y. Specifically,

and the equality holds when f is either a one-to-one map, or y and x are statistically independent given f(x)

meaning that all the information about y that is contained in x is fully reflected in f(x) or, said differently, the transformation f preserves all the juice despite being lossy.

Thus, when mutual information is used as a proxy for quantifying the amount of juice in a dataset, effective feature engineering neither decreases nor increases the amount of juice, which is fairly intuitive. Feature engineering merely turns inputs and/or outputs into a representation that makes it easier to train a specific machine learning model.

From Mutual Information to Maximum Achievable Performance

Although it reflects the essence of the amount of juice in a dataset, mutual information values, typically expressed in bits or nats, can hardly speak to a business analyst or decision-maker.

Fortunately, it can be used to calculate the highest performance that can be achieved when using x to predict y, for a variety of performance metrics (R², RMSE, classification accuracy, log-likelihood per sample, etc.), which in turn can be translated into business outcomes. We provide a brief summary below, but you can find out more here.

We consider a predictive model M with predictive probability

where f(x) is the model’s prediction for the output associated with inputs x. The model is a classifier when y is categorical and a regression model when y is continuous.

Maximum Achievable R²

In an additive regression model

the population version of the R² is defined as

The ratio in the formula above represents the fraction of variance of the output that cannot be explained using inputs, under our model.

Although variance is as good a measure of uncertainty as it gets for Gaussian distributions, it is a weak measure of uncertainty for other distributions, unlike the entropy.

Considering that the entropy (in nats) has the same unit/scale as the (natural) logarithm of the standard deviation, we generalize the R² as follows:

Note that when (y, f(x)) is jointly Gaussian (e.g. Gaussian Process regression, including linear regression with Gaussian additive noise), the information-adjusted R² above is identical to the original R².

More generally, this information-adjusted R² applies to both regression and classification, with continuous inputs, categorical inputs, or a combination of both.

A direct application of the data processing inequality gives us the maximum R² any model predicting y with x can achieve:

It is important to stress that this optimal R² is not just an upper bound; it is achieved by any model whose predictive distribution is the true (data generating) conditional distribution p(y|x).

Minimum Achievable RMSE

The population version of the Root Mean Square Error of the model above reads

In the same vein, we may define its information-adjusted generalization as

A direct application of the data processing inequality gives us the smallest RMSE any model using x to predict y can achieve:

Maximum Achievable True Log-Likelihood

Similarly, the sample log-likelihood per observation of our model can be defined as

Its population equivalent, which we refer to as the true log-likelihood per observation, namely

satisfies the inequality

This inequality stems from the Gibbs and data processing inequalities. See here for more details.

Note that the inequality above holds true for both regression and classification problems and that the upper-bound is achieved — by a model using as its predictive distribution the true conditional p(y|x).

The term -h(y) represents the best true log-likelihood per observation that one can achieve without any data, and can be regarded as a naive log-likelihood benchmark, while the mutual information term I(y; x) represents the boost that can be attributed to our data.

Maximum Achievable Classification Accuracy

In a classification problem where the output y can take up to q distinct values, it is also possible to express the highest classification accuracy that can be achieved by a model using x to predict y.

Let us consider the function

For a given entropy value h the best accuracy that can be achieved by predicting the outcome of any discrete distribution taking q distinct values, and that has entropy h is given by

where the function

is the inverse function of

and is easily evaluated numerically. You can find more details here.

The figure below provides an illustration of the function above for various q.

*Fig. 1: Accuracy achievable in predicting the outcome of a discrete distribution with q possible outcomes.*

More generally, the accuracy that can be achieved by a classification model M using x to predict a categorical output y taking q distinct values satisfies the inequality:

The entropy term h(y) reflects the accuracy of the naive strategy consisting of always predicting the most frequent outcome, namely

whereas the mutual information term I(y; x) accounts for the maximum amount of additional insights we can derive from our data.

To sum-up, the highest value that can be achieved by virtually any population-based performance metric in a classification and regression problem can be expressed as a function of the mutual information of the true data generating distribution I(y; x) and a measure of the variability of the output (such as its entropy h(y), variance, or standard deviation) when the naive (inputs-less) predictive strategy does not have a null performance.

Model-Free Mutual Information Estimation

Finally, we can address the elephant in the room. Clearly it all boils down to estimating the mutual information I(y; x) under the true data generating distribution. However, we do not know the true joint distribution (y, x). If we knew it, we would have access to the best possible predictive model — the model with predictive distribution the true conditional p(y|x)!

Fortunately, we do not need to know or learn the true joint distribution (y, x); this is where the inductive reasoning approach we previously mentioned comes into play.

The inductive approach we adopt consists of measuring a wide enough range of properties of the data that indirectly reveal the structure/patterns therein, and infer the mutual information that is consistent with the observed properties, without making any additional arbitrary assumptions. The more flexible the properties we observe empirically, the more structure we will capture in the data, and the closer our estimation will be to the true mutual information.

To do so effectively, we rely on a few tricks.

Trick I: Working in the copula-uniform dual space.

First, we recall that the mutual information between y and x is invariant by one-to-one maps and, in particular, when y and x are ordinal, the mutual information between y and x is equal to the mutual information between their copula-uniform dual representations, defined by applying the probability integral transform to each coordinate.

Instead of estimating the mutual information between y and x directly, we estimate the mutual information between their copula-uniform dual representations; we refer to this as working in the copula-uniform dual space.

This allows us to completely bypass marginal distributions, and perform inference in a unit/scale/representation-free fashion — in the dual space, all marginal distributions are uniform on [0,1]!

Trick II: Revealing patterns through pairwise Spearman rank correlations.

We reveal structures in our data by estimating all pairwise Spearman rank correlations in the primal space, defined for two ordinal scalars x and y as

It measures the propensity for two variables to be monotonically related. Its population version is solely a function of the copula-uniform dual representations u and v of x and y and reads

In other words, using Spearman’s rank correlation, we may work in the dual space (a.ka. trick I) while efficiently estimating properties of interest in the primal space.

Without this trick, we would need to estimate marginal CDFs and explicitly apply the probability integral transform to inputs to be able to work in the dual space, which would defeat the purpose of the first trick.

After all, given that the mutual information between two variables does not depend on their marginal distributions, it would be a shame to have to estimate marginal distributions to calculate it.

Trick III: Expanding the inputs space to capture non-monotonic patterns.

Pairwise Spearman rank correlations fully capture patterns of the type ‘the output decreases/increases with certain inputs’ for regression problems, and ‘we can tell whether a bit of the encoded output is 0 or 1 based whether certain inputs take large or small values’ for classification problems.

To capture patterns beyond these types of monotonic associations, we need to resort to another trick. We note that, for any function f that is not injective, we have

Thus, instead of estimating I(y; x), we may estimate I(y; x, f(x)) for any injective function f.

While pairwise Spearman rank correlations between y and x reveal monotonic relationships between y and x, f can be chosen so that pairwise Spearman correlations between y and f(x) capture a range of additional non-monotonic relationships between y and x.

A good example of f is the function

where m can be chosen to be the sample mean, median, or mode.

Indeed, if y = x² for some mean zero and skew zero random variable x, then the Spearman rank correlation between y and x, which can be found to be 0 by symmetry, fails to reveal the structure in the data. On the other hand, the Spearman rank correlation between y and f(x) (with m=0), which is 1, better reflects how informative x is about y.

This choice of f captures patterns of the type ‘the output tends to decrease/increase as an input departs from a canonical value’. Many more types of patterns can be captured by using the same trick, including periodicity/seasonality, etc.

Trick IV: Putting everything together using the maximum entropy principle as a way of avoiding arbitrary assumptions.

To summarize, we define z=(x, f(x)), and we estimate the Spearman rank auto-correlation matrix of the vector (y, z), namely S(y, z).

We then use as the density of the copula-uniform representation of (y, z) the copula density that, among all copula densities matching the patterns observed through the Spearman rank auto-correlation matrix S(y, z), has the highest entropy — i.e. the most uncertain about every pattern but the patterns observed through S(y, z).

Assuming (y, z) is d-dimensional, the resulting variational optimization problem reads:

We then use the learned joint pdf to estimate the needed mutual information as

Making Sense of The Maximum-Entropy Variational Problem

In the absence of the Spearman rank correlation constraints, the solution to the maximum-entropy problem above is the pdf of the standard uniform distribution, which corresponds to assuming that y and x are statistically independent, and have 0 mutual information. This makes intuitive sense, as we have no reason to believe x is informative about y until we gather empirical evidence.

When we observe S(y, z), the new solution to the variational maximum-entropy problem deviates from the uniform distribution just enough to reflect the patterns captured by S(y, z). Hence, our approach should not be expected to overestimate the true mutual information I(y; x).

Additionally, so long as S(y, z) is expressive enough, which we can control through the choice of the function f, all types of patterns will be reflected in S(y, z), and our estimated mutual information should not be expected to underestimate the true mutual information I(y; x).²

Application To An Ongoing Kaggle Competition

We built the KxY platform, and the accompanying open-source python package to help organizations of all sizes cut down the risk and cost of their AI projects by focusing on high ROI projects and experiments. Of particular interest is the implementation of the approach described in this post to quantify the amount of juice in your data.

The kxy package can be installed from PyPi (pip install kxy) or Github and is also accessible through our pre-configured Docker image on DockerHub — for more details on how to get started, read this. Once installed, the kxy package requires an API key to run. You can request one by filling up our contact form, or by emailing us at demo@kxy.ai.

We use the House Price Advanced Regression Techniques Kaggle competition as an example. The problem consists of predicting house sale prices using a comprehensive list of 79 explanatory variables of which 43 are categorical and 36 are ordinal.

We find that when the one-hot encoding method is used to represent categorical variables, a near-perfect prediction can be achieved.

Code snippet and output of the achievable performance analysis on the house price advanced regression Kaggle competition.

In fact, when we select explanatory variables one at a time in a greedy fashion, always choosing the variable that generates the highest amount of incremental juice among the ones that haven’t yet been selected, we find that with only 17 out of the 79 variables, we can achieve near-perfect prediction — in the R² sense.

Code snippet for the greedy variable selection analysis.

Screenshot of the Kaggle leaderboard at the time of writing this post.

The figure below illustrates the result of the full variable selection analysis. Interestingly, the top of the current Kaggle leaderboard has managed to generate an RMSE of 0.00044, which is somewhere between the best that can be done using the top 15 variables, and the best that can be done using the top 16 variables.

Tab. 1: Result of the KxY model-free variable selection analysis on the Kaggle House Prices Advanced Regression Techniques competition.

You’ll find the code used to generate the results above as a Jupyter notebook here.

Footnotes:

[¹] It is assumed that observations (inputs, output) can be regarded as independent draws from the same random variable (y, x). In particular, the problem should either not exhibit any temporal dependency or observations (inputs, output) should be regarded as samples from a stationary and ergodic time series, in which case your sample size should be long enough to span multiple of the systems’ memory.

[²] The requirements of the previous footnote have very practical implications. The whole inference pipeline relies on the accurate estimation of the Spearman rank correlation matrix S(y, z). When observations can be regarded as i.i.d. samples, reliable estimation of S(y, z) only requires a small sample size and is indicative of the true latent phenomenon. On the other hand, when observations exhibit temporal dependency, if estimating S(y, z) using disjoint subsets of our data yields very different values, then the time series is either not stationary and ergodic or it is stationary and ergodic but we do not have a long enough history to characterize S(y, z); either way, the estimated S(y, z) will not accurately characterize the true latent phenomenon and the analysis should not be applied.