Survival bias in genealogical materials
I’ve been working on lineage for a while, particularly on trying to bridge the divide in what we know about the Song-Yuan developments, and what we know about the Ming-Qing developments. I did this by putting together a large corpus of genealogical essays, primarily prefaces (puxu 譜序) for both distant and close reading.
Recently, I have been doing due diligence, by comparing my corpus to the known corpus of genealogical essays by Ming intellectuals (or at least those published in the Siku quanshu 四庫全書). And I have found some problems with my existing understanding. Not my readings of specific essays, but some of my assumptions about their representativeness. I think it is important for scholars to be open about this type of error, so here goes: an inquiry into the biases in my data set and the mistakes they led me to make.
To see the bias in my sample, and the mistakes it led me to make, let’s compare my corpus to the known corpus of genealogical essays. I have only made it through scholars active in the 1570s. But this is enough to see the problem:
Writers: 84 (my dataset) of 256 (known)
Essays: 510 (my dataset) of 1257 (known)
I found 40% of the essays, and 32% of the writers. Now depending on which essays made it into my data-set that may or may not be a big deal. It’s a big deal: the discrepancy gets bigger if you split the dataset in half.
1350s-1450s
Writers: 38/77 = 49%
Essays: 335/644 = 55%
1460s-1570s
Writers: 46/175 = 26%
Essays: 175/597 = 30%
So in other words, I found about half of all known early Ming genealogists, but only one quarter of mid-Ming genealogists.
I assumed that my sampling was consistent over time, when in fact I was missing far more from the mid-Ming than from the early Ming. This had major consequences for my conclusions. I want to talk about one in particular:
I convinced myself that there was a peak era of public genealogy, approximately 1330–1470, when leading intellectuals were especially interested in promoting genealogy as a tool for lineage formation. I have even presented two conference papers built, in part, around this idea. Here is the point visually:
The first thing I noticed was the number of genealogical essays written by the most prodigious genealogists of each era (green line). In the early 1400s, there were several writers who produced more than 40 essays each, but after 1460 no writer did so (at least that we know of). This was my first piece of evidence for a peaking of genealogical interest in the early Ming.
The median number of essays-per-writer appears to confirm this (blue line). Note that the median number of essays-per-writer basically tracks the peaks — it goes up substantially in the 1360s, and again between about 1410 and 1440. The overal median number of essays-per-writer for the entire period is 2, but for the early 1400s, the decadal medians are between 5 and 7. Given that the median is not particularly impacted by outliers, I was confident that I had confirmation of my conception of the early 1400s as a period of intense genealogical interest.
But wait a minute. Notice that my “era of public genealogy” essentially coincides with the period when I found a large percentage of extant genealogical essays. It seems like this apparent phenomenon may be the product of bias in my evidence.
Is it possible I merely missed the leading genealogists of the mid-Ming — the period when I only found about 1/4 of known writers?
It turns out that I did miss some fairly important mid-Ming genealogists including Zou Shouyi 鄒守益 (1491–1562), Ouyang De 歐陽德 (1495-1554), and Luo Hongxian 羅洪先 (1504–1564). That is a problem.
There is a bigger problem though, and it has to do not with my sampling but with the Siku quanshu’s sampling. In other words, it has to do with survival bias.
So here, we see that until the late 1400s, there are about 6–10 writers-per-decade with genealogical materials surviving in the published works. But after 1500, the number of suriving works appears to begin a secular increase, even beyond the decadal variation. In other words, it seems likely that for the mid-Ming I missed a lot of material simply because there is more material to miss.
More importantly this makes it quite likely that there is greater bias in the materials that survived from the early Ming than in the materials that survived from the mid-Ming onward. And this can be seen in another figure:
The historical norm is that there are some small number of writers who produced a very large number of essays, and a much larger number of writers who produced a very small number of essays. But we see here that there are three decades for which any surviving works with genealogical essays have more than one such essay. This is not the distribution we see in the real world. This means that the essays surviving from the early Ming are far more likely to be by prodigious writers — or in other words, that the works of minor writers did not survive.
In other words, the “age of genealogy” I identified is almost entirely the product of survival bias — that the only writers whose works survived the early Ming were significant writers who tended to produce lots of significant essays.
This is a longwinded way of saying, I was wrong. And we as historians need to be careful of this effect, as it appears all over the place.
One final note. Another effect I observed does appear to hold in the larger sample: writers affiliated with Ji’an and Fuzhou in Jiangxi, and Jinhua in Zhejiang produced and published far more genealogical essays on average than people from anywhere else. This phenomenon is especially notable in the early Ming, but it holds in general. For example, note that all three significant mid-Ming genealogists that I previously missed are from Ji’an!