A primer on causal emergence

This post is inspired by the physicist and blogger Scott Aaronson, who recently blogged his criticisms about a theory I’ve been working on, called causal emergence. To see the nature of his error, skip down to Isn’t causal emergence just an issue of normalization?, although this does assume you are familiar with some of the theory’s terminology. Since Scott’s criticisms seemed to reflect a misunderstanding of the theory, it prompted me to do this generalized explainer. Please note this explainer is purposefully designed to not be technical, formalized, or comprehensive. Its goal is to give interested parties a conceptual grasp on the theory, using relatively basic notions of causation and information.

What’s causal emergence?
It’s when the higher scale of a system has more information associated with its causal structure than the underlying lower scale. Causal structure just refers to a set of causal relationships between some variables, such as states or mechanisms. Measuring causal emergence is like you’re looking at the causal structure of a system with a camera (the theory) and as you focus the camera (look at different scales) the causal structure snaps into focus. Notably, it doesn’t have to be “in focus” at the lowest possible scale, the microscale. Why is this? In something approaching plain English: macrostates can be strongly coupled even while their underlying microstates are only weakly coupled. The goal of the theory is to search across scales until the scale at which variables (like elements or states) are most strongly causally coupled pops out.

​Isn’t this against science, reductionism, or physicalism?
Nope. The theory adheres to something called supervenience. That’s a technical term that means: if everything is fixed about the lower scale of a system, everything about the higher scale must follow. So there’s nothing spooky, supernatural, or anti-physicalist about the results. Rather, the theory provides a toolkit to identify appropriate instances of causal emergence or reduction depending on the properties of the system under consideration. It just means that, when thinking about causation, reductionism isn’t always best. The higher scales the theory considers are things like coarse-grains (groupings of states or mechanisms) or leaving states or elements out of the system, among others. These aren’t supernatural, just different levels of description, some of which capture the real causal structure (the coupling between variables) better. In this sense, the theory says that causal interpretations are not relative or arbitrary but instead are constrained by constraint.

How do you analyze causal structure?
Causation has long been considered a philosophical subject, even though it’s at the heart of science in the form of experiments and separating correlation from causation. Causation, much like information, can actually be formalized abstractly and mathematically. For instance, in the 90s and 2000s, a researcher named Judea Pearl introduced something called the do(x) operator. The idea is to formalize causation by modeling the interventions an experimenter makes on a system. Let’s say I want to check if there is a causal relationship between a light switch and a light bulb in a room. Formally, I would do(light switch = up) at some time t, and observe the effects on the bulb at some time t+1.

One of the fundamental ways of analyzing causation is what’s called an A/B test, or a randomized trial. For two variables A and B, you randomize those variables and observe the outcome of your experiment. Think of it like injecting noise into the experiment, which then tells you which of those two variables is more effective at producing the outcome. For example, let’s say the light bulb flickers into the {off} state while the light switch is in the {up} state 20% of the time. If you do(light switch = up) and then do(light switch = down) at in a 50/50 manner, it reveals the effects of the states. From this, you can construct something (using Bayes’ theorem) called a transition table:

Note that this reflects the actual causal structure. Flipping the switch {up} really does cause the bulb to turn {on} 80% of the time. Doing the A/B test appropriately screened out everything but the conditional probabilities between the states, such as how often you flipped the switch. And, ultimately, causal relationships are conditional. They aren’t about the probabilities of the states themselves, but about “if x then y” classes of statements.

Of course, we can also do A/B/C tests, and so on. What matters is randomizing over everything (creating an independent noise source) so that the result exposes the conditional probabilities between the states. The theory of causal emergence formalizes this as applying an intervention distribution: a probability distribution of do(x) operators. The intervention distribution that corresponds to an A/B test would be [1/2 1/2], and if A/B/C, then [1/3 1/3 1/3]. This is called a maximum entropy, or uniform, distribution.

How does information theory relate to causal structure?
Consider two variables, X and Y. We want to assess the causal influence X has over Y, X → Y. Assume, for now, there are no other effects on Y. If every change in the state of X is followed by a change in the state of Y, then the state of X contains a lot of causal information about Y. So if Y is very sensitive to the state of X, a metric of causal influence should be high. Note this is different than predictive information. You might be able to predict Y given X even if changes in X don’t lead to changes in Y (like how sales of swim wear in June could predict sales of air conditioners in July).

To be more formal about assessing X → Y, we inject noise into X and observe the effects on Y. Effective information, or EI, is the mutual information I(X;Y) between X and Y while intervening to set X to maximum entropy (inject noise). Note that this is the same as applying a uniform intervention distribution over the states of X.
This is a measure of what can be called influence, or coupling, or constraint. Beyond capturing in an intuitive way how Y is causally coupled to the state of X, here are a few additional reasons that the metric is appropriate: i) setting X to maximum entropy screens off everything but the conditional probabilities, ii) it’s like doing an experimental randomized trial, or A/B test, without prior knowledge of the effects on Y of X’s states, iii) it doesn’t leave anything out, so if a great many states of X don’t impact Y, this will be reflected in the causal influence of X on Y, iv) it’s like injecting the maximum amount of experimental information into X, Hmax(X), in order to see how much of that information is reflected in Y, v) the metric can be derived from the cause/effect information of each state y → X or x → Y, such as the expected number of Y/N questions it takes to identify the intervention on X at t given some y at t+1, vi) it isolates the information solely contained in the transition table (the actual causal structure), and vii) the metric is provably grounded in traditional notions of causal influence.

Ultimately, the metric is using information theory to track the counterfactual dependence of Y on X. In traditional causal terminology this is putting a bit value on notions like how necessary and sufficient the state of X is for the state of Y. The theory generalizes these properties as determinism (the lack of noise) and degeneracy (amount of convergence) over the state transitions, and proves that EI actually decomposes into these properties. EI is low if states of X only weakly determine the states of Y, or if many states of X determine the same states of Y (as those states are unnecessary from the perspective of Y). It is maximal only if all the causal relationships in X → Y are composed of biconditional logical relationships (iff x then y).
Another way to think about it is that EI captures how much difference each possible do(x) makes. In causal relationships where all states transition to the same state, no state makes a difference, so the EI is zero. If all interventions lead to completely random effects, the measure is also zero. The measure is maximal (equal to the logarithm of the number of states) if each intervention has a unique effect (i.e., interventions on X make the maximal difference to Y).

How is the metric applied to systems?
Consider, a simple switch/bulb system where the light doesn’t flicker. The relationship of which can be represented by the transition table:

In this system the causal structure is totally deterministic (there’s no noise). It’s also non-degenerate (all states transition to unique states). So the switch being {up} is both sufficient and necessary for the bulb being {on}. Correspondingly, the EI is 1 bit. However, for the previous case where the light flickered, the EI would be lower, at 0.61 bits.

Effective information even captures things traditional, non-information-theoretic measures of causation don’t capture. For instance, let’s say that we instead analyze a system of a light dial (with 256 states) and a light bulb with 256 states of luminance. Both the determinism (1) and degeneracy (0) are identical to the original binary switch/bulb system. But the causal structure overall contains a lot more information: each state leads to exactly one other state and there are hundreds of states. EI in this system is 8 bits, instead of 1 bit.

The key is that instead of thinking about causal structure in terms of just X → Y, we can instead ask about the causal structure of the system as a whole, S → S. This is like thinking of the entire system as a channel that is transforming the past into the future.

Couldn’t you use some other measure or numerical value?
The goal of EI is to capture how much information is associated with the causal structure of the system. But doing so doesn’t prove it’s the one and only true measure. There could be a family of similar metrics, although my guess is that most break down into the EI, or close variants. Regardless, given its relationship to the mutual information as well as important causal properties, this isn’t some arbitrary metric that’s swappable with something entirely different.

For instance, EI fits well with other fundamental information theory concepts, like the channel capacity. The channel capacity is the upper bound of information that can be reliably sent over a channel. In the context of the theory, the channel capacity is how much information someone can possibly send over the X → Y relationship by varying the state of X according to any intervention distribution. The channel capacity does end up having an important connection to causal structure. However, it’s not the same as a direct metric of X’s causal influence on Y. For instance, knowing the channel capacity doesn’t tell you about the determinism and degeneracy of the causal relationships of the states, nor does it tell you if interventions on X will produce reliable effects on Y, nor how sensitive the states of Y are to the states of X. With that said, one of the interesting conclusions of the research is that by looking at higher scales EI can approach or be equal to the channel capacity.

What does any of this have to do with emergence?
It’s about the emergence of higher-scale causal structure. To see if this is happening in a system, we do causal analysis across scales and measure the effective information at those different scales. What counts a macroscale? Broadly, any description of a system that’s not the most detailed microscale. Leaving some states exogenous, coarse-grains (grouping states/elements), black boxes (having states/elements be exogenous when they are downstream of interventions), setting some initial state or boundary condition, all these are macroscales in the broad sense. Moving from the microscale to a macroscale might look something like this:

Interestingly, macroscales can have higher EI than the microscale. Basically, in some systems, doing a full series of A/B tests at the macroscale gives you more information than doing a corresponding full series of A/B tests at the microscale. More generally, you can think about it as how informative a TPM of the system is, and how that TPM gets more informative at higher scales.

Wait. How is that even possible?
There are multiple answers. In a general sense, causal structure is scale-variant. Microscale mechanisms (like NOR gates in a computer) can form a different macroscale mechanism (like a COPY gate). This is because the conditional probabilities of state-transitions change across scales. Consequently, the determinism can increase and the degeneracy can decrease at the higher scale (the causal relationships can be stronger).

Another answer is from information theory. Higher-scale relationships can have more information because they are performing error-correction. As Shannon demonstrated, you can increase how much information is transmitted across a channel by changing the input distribution. The analogy is that intervening on a system at different scales is like trying different inputs into a channel. From the perspective of the microscale, some higher-scale distributions will transmit more information. This is because doing a series of A/B tests to capture the effects of the macroscale states doesn’t correspond to doing a series of A/B tests to capture the effects of microscale states. A randomized trial at the macroscale of medical treatments to see their effect on tumors won’t correspond to an underlying set of microscale randomized trials, because many different microstates make up the macrostates.

Isn’t causal emergence just an issue of normalization, as Scott claimed in his blog post?
Not at all. Scott’s criticism was about how the macro is compared to the micro. That is, the comparison between the fully-detailed and most fine-scaled description of a system (the micro) and some reduced description (the macro). His criticism was that the macroscale intervention distribution is different from the microscale intervention. His point was that, if you use the same intervention distributions in both cases, you get an equivalent bit value. Of course, this is tautological, as you are just doing the same thing in both cases. Mathematically, using the same intervention distribution would mean is that the macroscale EI is weighted by the number of microscale states within each macrostate. His further claim was that this should be done to compare the microscale and macroscale fairly.

There are strong reasons to reject the proposal. The first is that EI at a macroscale would no longer be a measure of causal structure. For instance, EI would not longer be able to broken down into the determinism, degeneracy, and size of a system. It wouldn’t tell you anything about the control, noise, or coupling between macro-variables, and would no longer be a function of the macroscale conditional probabilities, no longer be an A/B test, etc.

Another strong reason to reject this proposal is because using the intervention distribution of the microscale at the macroscale implicitly assumes knowledge of the microscale (knowing the constituents that make up the macrostates). And since a macroscale is a dimensionality reduction, this violates the very definition of a macroscale. We would be comparing apples and oranges.

And even worse, the EI wouldn’t even be calculable at the macroscale by itself. Consider again a simple switch/bulb system where the light doesn’t flicker:

Normally, the EI would be 1 bit. But the {up} state of a switch is actually a macrostate. There are many atomic configurations that the switch could be in (say, slightly different temperatures) while still being {up}. Same with the bulb’s {on} state. According to Scott’s proposal, the EI isn’t 1 bit, but some value weighted by how many microstates make up each macrostate. So to actually calculate the information in the causal relationship, you must first know the number of atoms that make up the light switch and light bulb (i.e., abandon your macroscale model). This proposal effectively makes the notion of a do(x) operator at anything but the ultimate microscale of physics nonsensical. This is both unattractive on the theory side and also contrary to how people operate when actually assessing causal relationships.

So maybe the specific proposal Scott gave to use the same intervention distributions has little to recommend it. Still, isn’t using different intervention distributions at different scales an issue of normalization? There are, again, good reasons not to think so. My paper shows that a change in the intervention distribution across scales is analogous to changes in an information channel’s input distribution. It would be a mistake to call Shannon’s description of how to increase information transmission over a noisy channel “an issue of normalization” merely because the input distribution changes depending on the encoding of some signal. Since the point of my paper is that these two things are mathematically equivalent, I’d say either a) the channel capacity itself is merely an issue of normalization, or b) the term normalization is not appropriate to input/intervention changes.

Why does causal emergence matter?
The theory does imply that universal reductionism is false when it comes to thinking about causation, and that sometimes higher scales really do have more causal influence (and associated information) than whatever underlies them. This is common sense in our day-to-day lives, but in the intellectual world it’s very controversial. More importantly, the theory provides a toolkit for judging cases of emergence or reduction with regards to causation. It also provides some insight about the structure of science itself, and why it’s hierarchical (biology above chemistry, chemistry above physics). One reason the theory provides is that scientists naturally gravitate to where the information about causal structure is greatest, which is where they are rewarded in terms of information for their experiments the most, and this won’t always be the ultimate microscale. There are also specific applications of the theory, some of which are already underway. These are things like figuring out what scale of a biological or nervous system is most informative to do experiments on, or the causal influence of one part of a system over another part, or whether macrostates or microstates matter more to the behavior of a system.

As time goes on, I’m sure I’ll criticize various ideas in my work. But I’ll make sure that I always keep an open mind when I do and never rush to judgment.