The Data Processing Inequality

If you look at the wikipedia article for the data processing inequality, it’s really just a stub (as of the time this article was published). The inequality is given, but there is little context. The data processing inequality is fundamental to data science, machine learning, and social science.

Lately, I’ve been blogging almost exclusively about causality. I’m about to go deep into causal inference, but needed to lay one more piece of groundwork before I do. There is a deep problem with how we encode the “real world” into a form that a computer can understand. The implications go far beyond limiting the predictive power of a machine learning model. Our representations of data can limit our ability to infer causal relationships. To understand this fully, you need to understand the data processing inequality.

The context

Suppose you have some event, X, that takes place in the real world. Say, X = the entire experience of you going to the store to buy groceries. It’s your state of mind, your grocery list, and the full context of the world around you. X is an event, but it’s not an event that has been encoded to do machine learning with. We haven’t even necessarily recorded any data at all about X. There are many things we could record — your heart rate; what you buy; the number of cars you pass on the street; whether you say “hi” to your neighbor.

At this point, X isn’t encoded at all. We don’t even have data for X. It’s just a thing that happened. We can try to represent it in words, but words fail to represent the whole reality. Suppose now that we keep a diary, and maybe tend to mention when we go out for groceries in our diary. We lose an immeasureable amount of information about our day by summarizing it into a diary entry, Y, but we can hope that it contains some information about going to the store. We now have some data about X, but it isn’t encoded very well for machine learning.

Now, suppose we want to predict something about the world. For example, we may want to predict our total spending on that day, Z=”the amount I spent today”. We can only do this using the data that we’ve collected, Y. We can encode Y as a binary variable, Y’ (“y-prime”), recording whether or not we mention going to the store. We can look at it every day, and see how much it says about our spending.

If we try to do this prediction problem, we’ll find that Y’ might do a decent job at predicting Z. Maybe it explains 50% of my spending on any given day! Still, it’s missing a lot of information that it could contain. What was on sale? Had you gone shopping earlier that week? Were you hungry when you went to the store? All of these can help explain how much I spent on that trip (not just whether there was a trip), and even help explain how much I spent at other stores on the same day.

The fact that we can enumerate these factors is promising. It says the more effort we put into recording and encoding our data, the better we can do at predicting Z. As we put more and more effort in, hopefully we can asymptote to the correct value of Z.

The inequality

In this sequence, we have some event in the real world, X, which we encode by summarizing and writing down, Y. We’ve lost a lot of information about X. Next, we encode this summary for machine-learning purposes as Y’, again losing a lot of information about Y (and therefore X). Finally, we hope that Y’ contains some information for predicting Z. You can write the data processing part from X to Y’ as a graphical model,

Data processing

where Y contains all of the information about X relevant for determining Y’. Y’ can’t contain any more information about X than Y does. You can write this nicely. If I(X,Y) is the information in common between X and Y, then you can write the data processing inequality as (from Elements of Information Theory, a great book!)

Elements of Information Theory, 2nd Ed., p. 34

What this is saying is that you can really only lose information by processing data. The information that Y’, our machine learning encoding, contains about X is less than the information that our raw data, Y, contains about X. You could extend this back and say that our raw data has also lost information about X. There is much more that we could have written down.

In the best case, you keep the same amount of information (you have equality). There’s no clever feature engineering you can do to get around this:This is true for any data processing process that doesn’t involve incorporating external information (so there aren’t other variables pointing into the Y or Z, and you indeed have a Markov chain).

Implications for causal inference

I mentioned that this problem is a fundamental limitation for causal inference. As we’ll learn deeply in the next few posts, causal inference relies on testing independence relationships [Edit: to be very precise, I’m talking about observational causal inference of causal graphs]. In particular, it needs to test conditional independence relationships. That means that if you have a Markov chain like A → B → C, you need to be able to condition on B and show that A and C become independent. Intuitively, what this means is that you need to be able to show that B contains all of the information about A that is relevant for determining the value of C. This should ring some bells from the previous section.

What happens if, when encoding B to B’, you’ve lost some information? This will almost surely happen in a typical machine learning problem. In that case, B’ no longer “blocks” A and C, and you can no longer determine that they’re conditionally independent of each other! If B’ contained all of the information about B relevant for determining C, then you could still block B and C. This can happen, for example, if B’ is calculated from some invertible mapping, B’ = f(B).

A causal chain, A->B->C, and a machine learning encoding B -> B’

Is there a way we can make B’ contain as much information as possible about B? Can we do this in a way that lets us do a good job of measuring these probability distributions? (we’d really like B’ to be relatively low-dimensional!)

If we can’t get B’ to contain all of the information about B relevant for determining C, then we can’t show A and C are independent given B. Unfortunately, inferring causal graphs depends on testing this kind of independence relationships. Even more unfortunately, you can’t test independence relationships without encoding your data!

Optimal representations?

It might sound like the situation is hopeless, but I’m not convinced that that’s true. Consider when the reality is limited to a specific, very measurable context. In video games, for example, the entire set of user actions can be used to learn about the user. What information might it contain about their thought processes? It could be enough information to predict their response to future stimulus within the context of the game. The distance between the data and the encoding is very small in this context.

There are still some real problems with measuring independence relationships even in this context. Consider that there are a large number of actions you can perform, and these define your “action history,” H. The problem amounts to estimating the odds you’ll perform a new action, A, in a new context, C, given your history, H, or P(A|C,H). The action could be binary, so coding that is simple. The context could be binary as well, so that might also be simple. This history, unfortunately, can be extremely high-dimsional. Even if it consists of only N binary past actions, it can be as many as 2^N-dimensional. You can’t estimate this density for any reasonably large history.

Fortunately, we have new technology that might handle this problem. Histories might be “close” to each other, and so it might make sense to embed them in some vector space. Is there a reasonable way to do this? Neural networks do this exactly, by mapping extremely high-dimensional inputs into lower-dimensional vector spaces. Tensorflow has a really nice introduction to vector space embeddings, focusing on embedding words. Personally, I’ve had some good success embedding sentences using RNNs. Googling around, you can find some really nice resources for more on embeddings.

I’m too much of an amateur to know if some work has been done to see if these embeddings are optimal, but what you would hope is that the embedding of the history contains all of the information about the history that’s relevant for determining the future action.

Psychometrics

This also has some implications for the way we develop psychometrics. In classical test theory, we’re satisfied if a psychometric test (e.g. an IQ test) has low expected variation in different instances of a person taking the test, and if the test score is a reasonable predictor of other variables (e.g. future income).

How do we know that we’ve developed a test that captures as much relevant information as possible? If we want to measure the effect of intelligence on future earnings, it’s clear that the problem is something like “real intelligence” → “IQ” → “earnings”. The data processing inequality applies. Without optimizing the IQ measurement for the prediction problem, we’ve almost certainly lost some information about intelligence that’s relevant for determining future income.

Could there be a way to use “blocking” to develop psychometric instruments? I’m not a psychologist — maybe there already is! If any of you know, I’d love to hear about it in the comments!! [Edit: Stephen commented with some nice resources! Naturally, there is some literature on the subject, including his dissertation!]