The Trouble with Experiments, and How to Take Advantage of Them.
This post is following up on an earlier one and the perils of selection of bias.
An endemic problem with observational study is that the assignment of the “treatment” is not controlled. In an ideal universe, the effect of something on a set of “subjects” can be evaluated only if the subjects who are “treated” and who are not can be considered effectively the same. If, after the “treatment” takes place, and the the subjects in both groups were exactly the same before, whatever difference that crops up after the experiment can only be attributed to the “treatment.” At least that is the idea.
In practice, two problems intervene. Even in the “real” experiments, the subjects are rarely exactly the same on both sides — it’s simply impossible. Usually, the hocus pocus of randomization is invoked to claim that the samples are “effectively” the same, but randomization is a very dangerous thing. We might suppose that subjects come in two flavors that occur with equal probability, H and T. If we were to randomly (i.e. the probability of assignment for each group is independent of the other) select N subjects each to control and experimental groups, what are the odds that they are the “same”? (and how same do you need the samples to be?) Trivially, if N=1, there is 50% chance that they will be different: the probability that both are H is 1/2*1/2 = 1/4. The probability that both are T is the same. So they will be different with probability 1–1/4–1/4 = 1/2. For larger samples, it gets tricky: the probability that they are exactly the same will in fact get infinitesimally small as N increases. But the probability that they are “close” will become asymptotically larger. If N=100, the probability that a sample will contain less than 45 T is around 5%. So the probability is about 80% that both samples contain between 45 and 55 T’s. Is this close enough? Maybe. Maybe not. But this is assuming that we know the distributions of how the subjects vary. The trick is that, in many cases, we do not know. So we are taking it on faith, almost literally, even when we randomly assign subjects, that the probability gods are working in our favor to keep the samples more or less the same.
Of course, we use statistics because we can’t randomly assign subjects in the first place. We only know the results. We know that the treated group is, say, taller, than the untreated group. For all we know, heads are taller than tails naturally. If, somehow, heads are treated far more often than tails, then it should not be shocking that the treated are taller because heads are treated and heads are taller — and perhaps that is not because the heads are treated. Again, the statistical reconstruction of an experiment is, conceptually, trivial: we simply wish to compare (treated|head) vs. (not treated|head) and (treated|tail) vs. (not treated|tail). Assuming that, in the beginning, all heads were the same, and all tails were the same, the (treated|head ) — (not treated|head) will give us the effect of (treatment|head) and (treated|tail) — (not treated|tail) the effect of (treatment|tail). Simple enough in principle, but not very easy in practice.
Two problems intervene. First, very often, we do not have many observations, or, indeed, any observations of (treated|tail) and (not treated|head). Second, not all heads are alike and not all tails are alike — and the heterogeneity within the subsets often conceal variables that are likely associated with why those that are “treated” are treated. A classic illustration of this shows up in voting in Congress. Democrats vote as do partly because Democrats are more liberal, and liberals vote together because they agree on things, and because the Democratic Party, as a political organization, enables a complex set of dealmaking among its members that cause them to vote together whether or not they agree on things. It takes a heroic set of assumption to draw from the voting records estimates of legislator “ideology.” The problem is that, while both forces are doubtlessly at work, the problem on the whole is underidentified and there is no way in heck that one could conclude it is one or the other, without knowledge specific to how parties work — i.e. introduction of “instruments,” as conceptually defined, even if not necessarily following the formulaic definition thereof.
The concept of instruments is a peculiar one, with a rather longer history than the development of formulas and with a very close tie-in to the idea of experiments. In the formal experiment, we artificially force the control and experimental groups to be the same, more or less. In the use of instruments, we recognize that the control and the experiments are not the same, but recognize that there are factors that cause the difference to manifest themselves in a different fashion. In the case of radio and New Deal, there is a recognition that those who are more politically active might both exhibit greater policy-election linkage and buy radio. The presence of the geological condition would dampen their inclination to buy radio, even if they are interested, although those who are politically interested would still exhibit policy-election linkage more than those who are not, even without radio. The additional effect from the radio would show up in areas where radio is more widespread, though, again, taking on the faith that the same sort of political activism exists regardless of geological conditions (or, at least can be accounted for through the variables that are known.) This is not quite the same as radio “causing” the linkage, in the literal sense, at least, but it is equivalent to saying that, where there is radio, political activism goes further in causing policy-election linkage. Much the same thing has been done with regards Congressional votes: different types of votes yield different DWNominate scores. As one might say, on some votes, you vote party, on others, you vote district, on others, you vote something else, or whatever.
The Rossman article linked in the previous piece, however, points to something far more important than just setting up a statistical analogue to an “experiment.” The statistics of causation does not actually capture causality per se, but conditional distributions, or, in the language of logic, “necessary” condition, except, in this case, it’s a matter of degree rather than a dichotomous phenomenon. The insight about radio in the radio-New Deal article is not so much that radio “causes” the policy-election linkage, but that the policy-election linkage is magnified conditional on radio, or rather, the geological conditions that enable radio. (In the dichotomous necessary conditions lingo, it would be equivalent to saying that you see the linkage only in presence of favorable geological conditions, but this is not quite the case.) The geological condition hindering adoption of radio dampens this linkage, much the way the Republican Party condition dampens the linkage between support for pot and support for greater redistribution. Both operate by restrictions on the “latent” sample: if the populations in the limited radio areas had access to the information disseminated via radio, they would have exhibited same electoral behavior as those where radio reception was good, much the way GOP would show the same linkage as the general population had the informal ideological restrictions on GOP affiliation not been present. The interesting information is much less about the “causality” as much as the consequences of these restrictions — we still don’t know the causality.
This implies a subtle shift from the focus of an “experiment.” The conceit of the experiment is that the experiment and control groups are exactly the same, except for the treatment. Therefore, whatever difference there is must be due to the treatment. The implication above is somewhat the opposite. Different subsets of data that are formed “naturally” and are decidedly not equivalent, exhibit different patterns. How are they different and why? This way of thinking is, actually, in a way analogous to analytical chemistry — trying to identify mystery chemicals is something that I was extremely fond of when I was still a science student. Say, you have some mystery salt. What can you do to identify what’s in them? You’d interact them with known chemicals and see what pops up, of course, after going through their physical characteristics. You know properties of, say, silver salts not dissolving in water, except nitrates, or something. After enough rounds and process of elimination, you’d have arrived at a narrow list of plausible candidates. The process is analogous: the means of investigation is a series of “interactions,” or conditional distributions (or, refractions, as I occasionally call them — all same things.) The “experiment,” simulated via statistical means are more likely the tests by an analytical chemist in the sleuthing process (or the artificial collisions induced in an atom smasher.) An investigation for conditional effects that reveal something about the underlying process, not something that is necessarily “causal” in nature — although some probably are. But this leap requires relaxing the conceit that we are necessarily investing “causality.”