Book Review: The Book of Why

Dan Saunders
Dec 31, 2018 · 4 min read

Author: Judea Pearl

Published: May 2018


Note: I wrote most of this blog post in the summer of 2018, and then forgot about it. I recently remembered it, so it’s time to finish it!


The Book of Why is perhaps my favorite book I’ve read so far in 2018. It’s a popular science book, but it seems geared towards those with at least some understanding of probability and statistics. The book is about causal inference; what is it, when / why it was developed, why it’s “necessary”. The main concept throughout the book is the “ladder of causation”, which has three rungs:

  1. Association: Having to do with seeing / observing events, and questions of the form “What if I see X?”, or “How are the variables X and Y related?”
  2. Interventions: Doing, intervening; with questions like “What would Y be if I do X?”, or “What needs to happen in order for Y to occur?”
  3. Counterfactuals: Imagining, retrospection; with questions like “What if X had not occureed? Did X cause Y?”

A central argument in the book is that classical statistics only deal with questions on the first rung of the ladder of causation; i.e., causation is all but forbidden in the discipline. In particular, statistics is mostly concerned with the reduction or summarization of data, while causal inference is further concerned with discovering the strength and directionality of relationships between variables. Statistics has embraced the randomized controlled trial (RCT), an important tool for untangling causal effects, say, in studies of drug effectiveness, but much more sophisticated methods have been developed since. Such causal inference methods are intended for cases where RCTs are not applicable or unethical; e.g., when randomly assigning a potentially harmful treatment / lifestyle modification (say, smoking or non-smoking).

An important tool of practitioners of causal inference is the structural causal model, from which a causal graph can be drawn. There’s a one-to-one relationship between the two. However, it’s typically easier for humans to think about graphs (and for computers to work with equations!), so I’ll focus on those here. A causal graph is given by a set of nodes representing random variables (endogenous and exogenous; the former are considered to be deterministic functions of some number of random variables, and the latter, sampled from some distribution) and directed edges between (representing causal links). An important point is that the functional relationships between variables is not restricted to any particular class (but many be assumed to be linear, parametric, etc.). Causal graphs are typically acyclic,

Causal graphs are a useful tool for incorporating scientific knowledge into a model of a particular real-world process. For example, it is fairly obvious that changing atmospheric pressure affects the reading on a barometer, and not the other way around. Supposedly, there are ways to discover causal relationships from raw data, a process called causal discovery, but the book does not go into these methods. Once we have posited a graph, we can use collected data on the nodes in it to provide evidence for independencies implied by the graph. A graph criterion called d-separation tells us when two sets of variables are independent, or conditionally independent given some other set of variables.


Pearl is something of a zealot for the nascent field, and certainly has a personal stake in its success, so the excitement of the text should be taken with a grain of salt. Nevertheless, this excitement is part of why I liked the book so much. Many of the ideas are statistical, but re-cast in the light of causality. So, these aren’t necessarily new ideas, but perhaps better thought-out and, importantly, theoretically justified with causal terminology.

I think it would be interesting to see how concepts from causal inference might inform reinforcement learning (RL). In particular, should RL agents build causal models of their environments? The RL “perception-action” cycle (pictured to the left) contains directed edges, and can be made into a DAG via time-ordering; i.e., the environment at time t affects the interpreter at time t through an observation, which affects the agent at time t through the reward signal and state (processed observation), whose action(s) affect the state of the environment at time t+1.

Building a detailed causal model of action -> environment state -> reward (where any node could be expanded into a set of random variables) may be a way to improve RL methods. Indeed, model-based RL builds a model of the environments transition dynamics and reward signal. Perhaps theoretical ideas from causal inference could spur progress in this endeavor.

Dan Saunders

Written by

MSc student in computer science at UMass Amherst. Likes machine learning and brain analogies. https://djsaunde.github.io