Response to Scott Winship

Published in

In a State of Migration

34 min readMay 11, 2021

My colleague at AEI, Scott Winship, and I have been having a debate recently about whether there have been meaningful changes in the rate at which women achieve their fertility preferences. This debate is a branch of wider debate between the two of us, which is whether child allowances would be a good thing or not. I think yes, Scott thinks no. The relevance of this debate about preference-completion is simple: if preference-attainment has declined, it could be suggestive of a deteriorating environment for family formation, and thus might help justify a child allowance. If it hasn’t declined, that might suggest the environment for family formation has not deteriorated, and so not child allowance is necessary (or at least it’s no more necessary than in the past). That’s the broad context. So let’s dive in.

Edit: Originally published at about 2:30 AM on May 11, the next day with more awakeness I did some substantial edits to several parts of this. I have tried to mark edits where they are significant.

The Events

Scott wrote an article for The Dispatch arguing that two cohorts of the National Longitudinal Survey of Youth show no meaningful change in fertility undershooting, and thus there is not good evidence for worsening fertility conditions. Note: The Dispatch is also AEI-affiliated, so this is basically an in-house debate. The only non-elected-official mentioned by name in the article is me. “Fertility is below the level women say they want” is very much my schtick; I’ve made the argument in the New York Times, so even if I wasn’t named, it’s clearly an article aimed at my line of research. But lest it wasn’t obvious, Scott started his thread by saying “Shots fired at everyone…” then proceeded to only actually mention one person at whom shots were fired: me. Finally, a quick note for students of media criticism: the article came out on a Friday.

So, Friday morning, Scott tags me on twitter for an article about me and my work published in an outlet for a think tank we both work at, and the theme of the article is that the most prominent part of my research is wrong. Obviously I was going to respond on Twitter, the medium on which Scott engaged me. So I took a few minutes, downloaded a few NLSY extracts, they didn’t seem to match what Scott had been sharing, so I tweeted in criticism of his post. He got mad and accused me of making ad hominem attacks, refused to make full responses on twitter, suggested we both post our data online, and that was that. He posted his on Saturday. I take the weekends off so I posted mine Monday. Before posting it, I looked over his, and found several errors (edit: initial phrasing here was ambiguous; I found errors in my work). In he process of cleaning those errors I made a new file, because that was easier than just cleaning the old file. In the process of making that file (in about an hour on a Monday morning), I made some other errors. His post correctly notes those errors. Now I am responding.

That’s the sequence of events for those keeping track. Now comments.

Comments

I’ll divide my comments into three categories:

Meta-Commentary
NLSY
Fertility Completion Generally

First, meta:

Meta Commentary

Twitter- Scott criticizes me for engaging on twitter. This is irritating because I engaged on twitter because that is where Scott engaged me via tagging me and sharing links to my work. Scott says he wants to “cast a light on Lyman’s Twitter-optimized style of argumentation and why it falls short of the rigor we need in policy debates.” But quite literally folks, Scott is the one who tagged me into his thread for the debate.
Collegiality- Scott says his article originated in an email exchange we had two years ago. The article is explicitly a response to my work and research. I am named in the article and the thread, my work is specifically identified, folks, it is indisputable that it is an article about my work. Scott links it to quotes from e.g. Josh Hawley about affordability, but as I’ll note below the NLSY data doesn’t say anything about affordability one way or another. The specific question of completion vs. preferences, the entire framing of the post, is aimed at me. In a normal environment of collegiality, when one person is going to directly attack the work of their colleague in the same institution, it is considered at least polite to warn them. I have told Scott in advance on occasions when I was going to write something specifically responsive to him even in cases where he was not named and his work was not linked. He is aware of articles I have waiting which have not yet been published. The reason I have done this is because I do not merely say that I respect Scott and his work, I in fact do respect him and his work, verb tense; that is, if I’m going to make a go at tackling his work, I’m at least going to offer the minimum amount of politeness and tell him. At least I’m not going to do a Friday dump on him.
Public Debate- Which is not to say colleagues should not have fierce public debates! We should! And the debate between Scott and I (and more broadly between AEI’s more economics-minded tax policy team + poverty studies team vs. religious conservatives writ large) has I think been good for conservatism generally. We should have a public debate. But that can be a debate about ideas where we all say clearly what we are for and against and have an argument of ideas, or it can be subtweets.
Bad Faith- Scott says I accuse him of arguing in bad faith and that I “impugn” him as a researcher. I regret that he heard me that way as that was not intentional. Furthermore, hearing me that way was an error. I did not accuse Scott of arguing in bad faith. The relevant tweets are here; I accuse Scott of a motte-and-bailey, which is not a form of bad faith, but a logical fallacy. Motte-and-bailey arguments have a “true argument” (the bailey, or inner fortress) and an “exterior argument” (or motte, the outer defenses). In such an argument, the arguer conflates two different arguments. This can be duplicitous, of course, if the person does so with the intention to deceive, as best I can tell I nowhere suggested Scott had an intention to deceive. More to the point, I specifically posited a non-duplicitous explanation, which as I understand it Scott agrees with: that the argument is simply part of Scott’s effort to assemble a broad case against the child allowance. That’s exactly what I said he’s doing. I am not willing to grant that his article exists in perfect isolation and so should be criticized only on its own, it is part of (and Scott explicitly admits in our twitter discussion) Scott’s wider argument against child allowances. The “bailey” is “child allowances are bad.” Scott’s “motte” is “The NLSY shows that fulfillment of expectations has not changed over time, which indicates that affordability issues have not changed over time.” These arguments can both be honestly made and Scott can honestly think they’re closely related. He is offering you the motte and claiming it is true and closely related to the bailey; my contention is that it is not closely related to the bailey, and perhaps not even true (but we’ll get to that below).
Publication and Scholarship- Scott wants to talk about the terms on which public policy debate occurs. But here I think he makes a mistake. Public policy writing is not scholarly research. It is not held to the same standards. And moreover, Scott’s own article in The Disptach is thinner on methodological statements than my twitter thread is! Using a longer form to write your results doesn’t make duplication easier, and Scott did not publish his data for replication until challenged. If Scott wants to argue that all public policy writing should be pre-registered with public data files that’s perfectly cogent, but that’s not the standard he set for his own work. Scott wants to criticize on Twitter, but have them respond with blog posts. It’s a double-standard. Note that I am not saying that Scott is acting in bad faith; people can have double standards in good faith, it’s merely part of being human. None of us live up to our own standards. That’s why public, adversarial processes are so important, because they utilize our worst impulses to create a collective good. So when I say that Scott demands different standards for others than he demands for himself, I am not accusing him of anything embarrassing or immoral, I’m merely saying that the things he sees as damning in others are damning of pretty much everyone, including himself, and a more charitable standard is therefore probably appropriate. When someone says they think you did something wrong in your code, they are not “impugning” you: they are doing their job as your peer and the correct response is, Thank You.
Nonexistent Problems- I’m gradually getting more into the content of the debate so here we’ll get to what may simply be a lexical ambiguity. Scott’s article is clearly about child allowances. You can know this because the first paragraph is links to proposals for child allowances. Any reasonable reader will interpret it primarily not as scholarly research, but as political speech. Again, this is not impugning Scott. It is not an insult to suggest that The Dispatch (for which I have also written!) is a political publication or that AEI (where I also work!) is a political organization. These are not ad hominems. These are statements of fact. Thus, statements made in articles can reasonable seen to be intended to refer to political circumstances. And when the introduction to an article is about child allowances, the meat of the article is about arguments for child allowances, and the only other research cited is research in favor of child allowances, it is reasonable to suppose the article is about child allowances. This should help us understand the last sentence: “While child allowances and parent tax credits may have sensible rationales, policymakers should not offer them as solutions to problems that do not exist.” This sentence is ambiguous. What problem is it that does not exist? I interpreted it as saying: “@swinshi ’s arguments that below-desired fertility is a ‘problem which does not exist,’ or his related argument that marriage is not deterred for financial reasons, or his argument that financial concerns are not a major force suppressing fertility, are all wrong.” But Scott says he meant “My entire piece is addressing the claim that women are less successful at achieving their desired fertility than in the past — that is the problem that does not exist.” But there’s the motte-and-bailey: what exactly is the problem that doesn’t exist? Because the problem child allowances are intended to solve from a pro-natal perpective is “People having fewer children than is desirable.” That’s the argument pro-natalists are making. Scott wants to attack an extremely limited version of that argument, namely, that there was not a problem for the women in the NLSY79 cohort, but that if there was a problem today, it would show up in the NLSY97 cohort. But nowhere does he explicitly say that is the problem which does not exist. It’s ambiguous. Scott says I’m treating him unfairly by interpreting the “does not exist” problem as applying to “the things that motivate child allowances for pronatalists;” but I think what’s actually happening is he ended his piece with a punchy line that also happened to be extremely ambiguous, and in the interpretation Scott seems to prefer, has little relevance to the actual question of child allowances, despite child allowances clearly being the point of the article. I mean the mere fact that incomes have risen in real terms is ipso facto proof that on some sufficiently absolute level fertility has not “gotten more expensive,” but that’s an empty statement. Scott is attacking an argument which pronatalists are not making (“child allowances are primarily justifiable because fertility completion rates have fallen sharply between the cohort of women who were at peak reproductive years in the late 1980s vs. those in the late 2000s”) in order to advance a different position which is only tenuously related (“child allowances are not a good policy”).
My Response- Before we get into the NLSY, I want to emphasize again here that what happened is Scott and I had an email exchange two years ago, and he got around to this article just recently, gave me no heads up I was about to get “shots fired” at me (not everyone: literally just me) by a colleague, I responded on Twitter with about an hour’s work while I was also doing other stuff (remember folks: think tank stuff is Scott’s full-time job; for me it is approximately 20% of my taxable income and under 30% of my working hours, so in terms of who can buy ink by the bucket here I’m outgunned; I don’t get paid to argue about politics, Scott does), Scott got upset at the Twitter format (despite having started the debate on Twitter) suggested we swap data, I said fine, he shared the data he’d been working on for some period of time (2 years????), I shared the data I’d been working on for…. 2 hours? If my response was hurried and sloppy, that is because it was hurried and sloppy, because I was unfortunately not afforded much time to prepare a defense when there was an article essentially about me and my work being published by my colleague.

NLSY: Who’s Right?

I do not have time to go through the NLSY yet again to figure out how much my results would change if I use slightly different weights and a handful of other things. There were errors in the data I shared on Friday, and more on the data I shared Monday (lumping in men in the Monday data: doh! that was a gaffe! Good catch Scott!). It sometimes turns out that when somebody else has been preparing their argument for 2 years and you have been catching up for 2 hours, you make mistakes. I am happy to concede I made many. It’s very plausible that, for what he is calculating, Scott is entirely correct, and that my later critiques failing to replicate it were wrong. But my main critique was not “Scott did the math wrong” but “the NLSY just doesn’t say what Scott wants it to say.” So it’s important to think about what the NLSY is actually measuring.

NLSY Sample and Weights

I maintain that these surveys cannot tell us much about the question Scott wants to answer. The reason is that, as a matter of principle, the NLSY surveys cannot tell us anything about anyone other than the respondents in the survey. They have no general validity for the wider population. Let’s talk about why. I know many users of NLSY data will bristle at this, but I don’t think there’s any way to argue for general representativeness of this data.

When you take a survey, you try to get a random sample of the population, because a random sample is likely to, on average, if it’s big enough, represent the population, at least at certain mean estimates. If I have a jar of 10,000 jelly beans of 5 colors that’s been shaken for an hour (and all jelly beans have the same density and size so as to avoid the Brazil nut effect), I can probably draw out 200 or so to get a decent estimate of what the flavor distribution is. Simple enough.

No survey ever actually achieves a random sample of the population because some people are harder to sample than others. There are various ways we correct for this, such as by oversampling certain groups, specifying quotas, exerting greater effort for certain groups, etc. But at the end of the day, our survey sample is almost never quite representative. Still, if our recruitment procedure was good, we think it’s probably good enough. If we are willing to provisionally accept that our sample is close enough to random (or that deviations can be accounted for analytically or via interpretation; so for example CPS is primarily a survey of the “household population,” which is not in fact the entire population, etc), then there’s another step: we assign weights. Suppose we are interested in the sex ratio of a population on the whole and we think that sex ratios might vary by race, and suppose we know the racial mix of the population on the whole, but not the sex ratio. We’d take our sample and compare its distribution to the population on the whole. To the extent the racial mix is different than the population on the whole, we would hope we can attribute it to either 1) known and accountable sampling procedures or issues of some kind, 2) small variations due to random chance. If we see a big deviation in our racial mix not related to known issues in our sampling strategy for which we can specifically account, we have a problem.

A classic example of this issue is in polling. “Dewey Defeats Truman” wasn’t a problem of sample weighting, but of non-weightable bias because sample selection was non-random: the people sampled were systematically unlike the non-sampled people. A more recent example is recent polling errors; we happen to know that low-trust people are uniquely unlikely to respond to polls, and furthermore that low-trust people are increasingly sorting into populist-right-leaning politics. Ergo, there’s a response bias specifically related to the dependent variable. No amount of weighting can correct for non-random sample selection. This is really important to grasp. It doesn’t matter how much you weight on the observables if the unobservables driving selection into your sample are meaningfully correlated with your dependent variable.

So with that, let’s turn to the NLSY. We’ll focus on women for simplicity. The graph below compares NLSY women in the 1979 and 1997 cohorts, breaking out the main vs. supplemental samples (the supplemental sample is one of those oversamples I mentioned above). The share I’m showing is “what percent of respondents whom the NLSY identified to begin the survey and who are not dead as of the survey wave in question are refusing to respond or unlocatable.” I could not locate the share of screened-in people who were dropped at first wave by sex, so I assumed the share among women was the same as for the sample overall for each wave and sample group. Here’s what we get:

What you can see is that the NLSY 1997 main and supplemental samples had lost about 1 in 10 respondents before they even got started, and the 1979 supplemental had lost over a quarter of its respondents. Yikes!

But actually, I’m not worried about that initial rate of nonresponse. Nonresponse to some degree is normal, and that response rate isn’t actually that low.

The bigger issue is what happens over time. Those big NLSY1979 jumps are intentional drops of sample components, but the other changes you see are all related to respondents refusing the survey or being impossible to locate. This matters, because it means that we’re seeing sampling attrition over time.

Now, if that attrition is random with respect to variables of interest, then maybe it’s not a problem. And indeed, prior research suggests a lot of the attrition really is random across many economically interesting variables like employment. But it’s not random across others: sampling attrition is related to AFQT score, a measure of intelligence, for example.

So is attrition related to fertility preferences or fertility outcomes? Well, this turns out to be a bit tricky! On the one hand, it looks like the answer is yes: attrition rates are systematically higher for women with zero-child expectations, and systematically lower for women with high-parity expectations, at least through the mid-30s that Scott is analyzing. Note that a few of these women already had children and I’m not accounting for that. Also, I’m only using the main sample since in NLSY1979 a very large share of the supplemental sample was intentionally dropped at certain intervals.

While attrition is much higher for those low-expectations women, the trend over time is about the same. And by the way I’m only measuring among women who responded in 1979.

Even if the trend over time was wildly different, that wouldn’t necessarily tell us anything, because that correlation might be related to some other variable and not specifically selective on expectations.

If we are to assume that this is attrition selective upon expectations, then we would conclude that women with high expectations are selectively more likely to be sampled in the NLSY1979, at least through their mid 30s where Scott measures. This is important, because women with high expectations would tend to undershoot, whereas women with low expectations would tend to overshoot, if fertility behavior has any regression to the mean (spoiler: it does). In other words, in NLSY1979, women likely to undershoot are non-randomly oversampled due to attrition.

So how does NLSY1997 compare?

Surprise, the relationship flips!

In NLSY 1997, attrition is higher among high expectations women, and lower among low expectations women. With more high-expectations women vanishing from the sample (i.e. with more “high likelihood of undershooting” women selecting themselves out of the sample), undershooting in NLSY will tend to be understated. The women most likely to undershoot had higher attrition, and the women most likely to overshoot (women with 0-child expectations) had lower attrition.

In other words, both samples reveal differences in who gets resampled based on fertility expectations, and but the difference changes between the survey waves. Crucially, nobody is weighting based on surveyed expectations (not least because there aren’t readily available population shares of expectations among young women). In other words, no amount of re-weighting on the usual observables like race or education will actually result in a final population which is representative. Simply re-weighting the final to match the initial distribution doesn’t cut it, because the issue isn’t that you just got a different result due to random variation but that you plausibly had respondents selecting into/out of response status in a way fundamentally related to their childbearing expectations, but you don’t actually know what that attrition function is.

These two samples have different attrition patterns vs. fertility expectations, and in both cases their attrition patterns seem to be less than totally randomly associated with those expectations. And conveniently, the effect of that attrition is to generate a reduction in undershooting: an effect that would obviously be of relevance to Scott’s thesis that completion rates have not declined.

More broadly, these kinds of relationships are very common. Attrition from longitudinal surveys is almost never random. Trying to generalize about a birth cohort after 20 years of attrition is just not reasonable. It is extremely unwise to try to make inferences from the NLSY population to the general population. For one extremely obvious reason, consider immigration: the NLSY does not add in new respondents to account for immigration, yet large shares of the birth cohorts of NLSY respondents at later years will be made up of immigrants (and non-trivial shares of the cohorts who were present when the NLSY was sampling may have emigrated!). For all these reasons, it’s best to simply say, “We observe X among NLSY respondents,” rather than, “The NLSY suggests X happened in society generally.”

The Questions

NLSY1979 asked three questions about fertility preferences in 1979: how many children women thought was ideal, how many they desired to have, and how many more they expected to have. The ideals and desires questions were asked in 1979 and 1982. The expectations question was asked numerous times in virtually every survey wave from 1982–2012. For now we will ignore the ideals and desires questions and focus on expectations. In 1979, the question wording was:

“ALTOGETHER, HOW MANY (MORE) CHILDREN DO YOU EXPECT TO HAVE?”

And that remained about stable over time.

But question wording isn’t all that matters. Priming effects are very real things as anybody who does survey work knows! So in the 1979 survey, expectations were surveyed as a subset of fertility questions (go to page 19 here). Respondents were first asked ideals, then asked personal desires, then asked fertility history, then asked expectations. In other words, by the time they got to the expectations question, these approximately 16 or 17-year old women had been asked quite a lot about their fertility expectations, and they’d specifically been asked it in an order well-known to anybody who does family surveys: ideals, desires, intentions. This order dates back to old KAP surveys and the 1955 US GAF, but to my knowledge nobody has formally tested the extent to which priming people with ideals may cause them to automatically downgrade expectations. It’s on my list of things to test in the fertility surveys I run, but I haven’t gotten to it yet.

But regardless, what we want to know is if the 1997 survey asked this question in a similar way.

And of course, it did not. In fact, the 1997 round didn’t ask it at all. The expectations question was not asked until the 2001 survey, and even then only of an experimental subgroup of respondents. And why was it only asked of some groups? Quoting from NLSY’s website:

In round 5, respondents were randomly divided into four experimental groups of roughly equal size that were used to study response rate for this type of question [LRS: questions about expectations of the future]. Instead of answering expectations questions in a single section (as in previous rounds), they encountered them separately in each appropriate topical section. Expectations about education were asked in the schooling section, expectations about pregnancy were asked in the fertility part of the self-administered section, etc. The variable SYMBOL!TXGROUP indicates the experimental group to which each respondent was assigned. A non-experimental group of respondents skipped these questions in the topical sections and instead were guided through the expectations section as in previous rounds.

In the twitter discussions I had with Scott, I did not do a very good job explaining my concern with this question. As noted, I was blindsided by the whole line of attack, and was trying to remember why I thought this comparison was a bad idea when I first read over the codebooks several years ago.

So let’s note several differences here:

NLSY1997 respondents were not primed about ideals and expectations (i.e. were not encouraged to rationalize expectations downwards)
Not all NLSY1997 respondents took the survey (i.e. the sample size is going to be smaller and thus noisier)
They answered the question at a different age (and with different prior childbearing)
The specific reason NLSY1997 asked about fertility expectations was to test and see if its inclusion in different places would alter question responses.

It was literally an experimental study in how question placement might alter responses. To me, “Small subsample that the survey designers fielded as an experiment in survey design” is not a winning mix. The objective of the experiment was to see if adding expectations questions outside of the regular expectations module would impact responses to the expectations module or questions in the module with newly-added questions, so it is reasonable to suppose that this question placement may not be directly comparable.

I do not know what the result of the experiment was, but I do know that it was not repeated. NLSY1997 would not ask another fertility expectations question until 2009. When they did, they would ask it of all respondents, and they would ask it only after a battery of questions asking about the “percent chance” respondents would do different things, a question battery present in 2001 in the expectations module. That is to say, when NLSY1997 returned to the fertility expectations topic in 2009, they had discontinued use of the arrangement they tested out in 2001, and instead had dumped a bunch of probabilistic questions in front of the final expectations question. In other words, they’d decided that before you ask a person their count-based expectations, you’ve just gotta prime them somehow. Instead of priming with ideals/desires, they primed with “percent chance you have at least one more kid” and stuff like that. Asking out-of-the-blue parity expectations questions evidently didn’t go well. You can contest this approach of course: maybe totally non-primed responses are better! But the presentation of the question did change.

In general, the NLSY1979 and NLSY1997 questions have some important comparabilities: they are both questions of expectations of similarly aged women. In a large survey database, it is reasonable to include them both as noisy estimates of expectations among their initially-sampled populations. But comparing them directly, head-to-head, as identical questions, is a mistake. The NLSY1979 question format is much closer to what is “standard” in family surveys and the NLSY1997 Round 5 question was an experimental format which was not made uniform when the expectations module was reworked for fertility for Round 13 in 2009. If your goal is to say, “Yeah, NLSY1979 women expected about this many kids and NLSY1997 women expected about that many kids,” fine. But for trying to do a horserace, they are too different to enable precise comparison, and they quite literally started the race at different times (i.e. different ages and parities).

Other Sampling Issues

And of course, we have the issue of selective sampling. Scott seems to think I made an egregious error here but I’m not quite following. Here’s what I get for the share of women who gave a valid response to the 2001 expectations question vs. their number of children in 2001:

Scott says that because this is statistically insignificant, it is meaningless. But that’s a laughable claim: statistical significance is subjective for one thing (i.e. what threshold for false positives we deem acceptable is a subjective choice; God did not hand down the 0.05 standard from on high), and secondly it is absolutely false to claim that statistically insignificant effects cannot significantly impact downstream results. Throwing an individually statistically insignificant covariate into a regression can dramatically alter the significance of other variables. The difference observed here is not trivial. While the confidence intervals do overlap, the mean estimate for mothers is well outside the range for the mean estimate for non-moms, and the mean estimate for non-moms is just barely within the range for moms. Given the very small sample sizes we’re working with here, ignoring that seems silly. The 133 “extra” responses from the non-moms (vs. if they had the same mean as moms) could easily skew the sample, since they are about 16% of the responding-sample. In power terms, that difference is vastly larger than you’d need to skew overall means.

It also seems like Scott thinks we should be using weights for these kinds of calculations, but this is clearly wrong. When calculating within-sample odds of question response, you don’t want to use weights calculated based on ex-sample population characteristics; your population of interest is the sample and so you don’t need weights, because the counts are the appropriate weights.

For 1982, the gap is smaller, but much more precisely estimated, and runs in the same way. Scott flagged an error in what I previously had for this. He’s probably right that there was an error. As noted many times above, the graphs I previously shared were quite rushed. But this graph is correct:

So what we see is the same direction of bias. But whereas response rates in 1982 are only about 1% higher for kid-havers than others, which I think is small enough that it wouldn’t have the power to drive any significant differences, response rates in 2001 are 16% different, because no effort was made to ensure that the experimental groups were representative. Randomization with small sample sizes does not guarantee representativeness; randomization only approximates the true population mean as n increases. Since the NLSY1997 was randomizing across 4 groups and Scott and I are interested in just half of each group (women), that ends up being a pretty restrictive sample, and so the randomness →representativeness relationship breaks down. Scott says, “These sample sizes are normal in social science research!” To which I say, yes. They are normal in social science research. Let the reader understand.

Fertility in NLSY

Finally, NLSY1979 surveyed women through the end of their reproductive career, but NLSY 1997 has not done so yet. The tempo of birth has changed by a very large amount over that time. It is far from clear what the best way to deal with this is. Comparing women in their mid-30s is not a remotely meaningful fertility comparison at all. Those women have a decade or more of childbearing left. Without some formal forecast of remaining births, the exercise is pointless.

But let’s use Scott’s data anyways and see where it leads us. We can use the files he posted (specifically his tabulations of differences vs. expectations in 1979 and 2001) to estimate the average gap vs. expectations. We can then compare that to the graph I shared of preference gaps:

What’s interesting is that Scott identifies a somewhat larger gap vs expectations that I observe comparing birth cohorts to numerous surveys of expectations or intentions. But the trend is similar: the gap between his estimate of differences vs. expectations in the late 1980s is similar to what I show, and again for the late 2000s. But because he has only two points of comparison, he misses tons of the variation in between, and so misses what the current trend now, today, in the world we actually live in, is. And of course, comparing vs. expectations may be an invalid approach anyways, as I’ll show below.

Edit: I wrote this originally at 2 PM, so in my verbal description compared to the wrong graph line. Woke up and corrected it in the morning.

But this brings us to our third major point…

Fertility Preferences and Completion Generally

Fertility Preferences for the US

Scott raises some questions about my estimates of preferences and completion. He says we don’t know how I make my estimates. We don’t know where they come from. They’re inconsistent across sources.

Man, really gotta do your lit review!

Here’s the actual description of the source. And it also has a nice note at the bottom where I’ll share the database on request. But you don’t even need to do that; here’s a (now somewhat dated) public Google Sheet I’ve circulated links to many times. It isn’t published anywhere more formally yet because compiling all the surveys is a ton of work, I’m constantly adding new surveys, and eventually I hope to make a real academic publication of it, and I dislike having my work stolen. But much as it is polite to notify one’s colleagues before publicly attacking them, it is also polite to request that data be shared before one does the “just asking questions, why won’t you share the data” routine. The data was online, described, and even an old form was downloadable, when Scott wrote his critique.

To be honest though, I had hoped to have a publication giving this database its “big debut” last year. Alas, that fell through (for reasons unrelated to database quality: apparently you have to have novel theoretical contributions to publish things academically, not just “I have advanced the state of our knowledge about the geography and history of fertility preference by a gigantic leap”). And I’m slowly trying to migrate to OSF for more stuff. So here’s the link to the current working copy. And in a new file I’ve added to the arguing-with-Scott OSF page (link at bottom), I’ve included how I weighted each result. But look, let’s just look at the raw estimates!

What you can see is that these estimates have some noise to them, but that intentions/expectations were very similar to ideals/desires/wants through, say, 1985 or maybe 1990. But then the desires rose during the late 1990s and early 2000s, while the intentions did not. And since COVID or just before, desires have begun to fall; but intentions have been falling much more and since earlier. Comparing fertility vs. intentions will necessarily get you a smaller estimate of the gap than vs. other metrics.

Scott worries about how these preferences may vary: are there systematic biases in type of preference?

Great question! That’s a key part of my dissertation that I’m working on! But here’s just one example of how this stuff shakes out from a survey that I ran in March/April:

What I want to point out is that “happiest to have” question. Here’s how it looks in the survey:

And what I’ve done is weighted the responses by how many stars women clicked for each one.

The key point is when you account for the fact that women might have non-linear happiness expectations for different numbers of children (and some women are even bimodal apparently!), and weight all their options by how happy they say each one would make them, even during the COVID-19 pandemic, which has suppressed birth rates, American women say about 2.3 kids would make them happiest. This is actually higher than their “general ideals,” a framing Scott criticizes. Nor is my finding unique. A new paper presented at PAA this year by Jason Thomas and John Casterline found that if you estimate fertility preferences in Sub-Saharan Africa by accounting for wantedness of prior births in a creative way to adjust for upward-rationalizations, and if you then add on parity-specific odds of intending an additional birth, you actually get slightly higher fertility preferences than in the standard DHS ideals question (which I also duplicated above).

Note that my measure of intentions is extremely low. This is probably due to multiple factors: 1) response quality; you can see that I surveyed intentions using two different methods and got wildly different results, 2) COVID has wreaked havoc of near-term plans, which disproportionately influences intentions vs. desires. Once the next NSFG round comes out I should get more clarity on what’s happening in my sample.

But the point is that there is simply no reason to think that intentions or expectations are a cleaner estimate of preferences than ideals; indeed, ideals seem like a better proxy for what women say would make them happiest.

Rationalization and What Preferences Matter

Edit: This section substantially expanded the morning of the 11th

Finally, a crucial critique. Not only do I think that intentions are a no-better-indicator of preferences than ideals, I also think that intentions are actually a systematically unreliable indicator of “the baseline against which subjective wellbeing is constructed.”

To start with, the entire use of “expectations” questionsis probably invalid when thinking about welfare losses due to fertility undershooting (or overshooting). And notably, “expectations” questions have gradually been phased out of many new surveys because interpreting them is formally very difficult. They reflect neither what women desire nor even what they plan. In a paper I’ll present at the Canadian Population Society’s conference next week, I show that unexpected childbirth is not associated with any decline in subjective wellbeing below pre-conception levels, even though expected childbirth is associated with a significant increase above baseline, and that increase is durable and may even grow over the ensuing decade. I demonstrate this using data from the British Household Panel Survey. My theory on what’s happening here is some unexpected children are not undesired, because there’s some prior research showing undesired children are associated with considerable declines in subjective wellbeing. In other words: expectational status of a birth may not predict the actual utility cost to childbearing (or not), because it incorporates rationalizations about the facts on the ground, which may already reflect agents who have just made their peace with diminished expected lifetime utility. Here’s some of my slides showing some results:

Of course, there are two kinds of rationalization. One is adjusting ideals to fit reproductive outcomes; i.e. women who get to a certain age and have only had 2 kids when they said they desired 4 adjust their desire to 2 or 3 in many cases. That is referred to as “rationalization.” But a second kind, much less studied kind, of rationalization relates to the interaction between kinds of preferences: women who get to a certain age, have 2, continue stating a desire for 4, but reduce intentions to 2 or 3. In this case, desires are not rationalized, but plans are. These two kinds of rationalization may be quite different.

As you saw above, rationalization of only intentions but not ideals is quite common: stated general or personal ideals as well as stated happiest-outcomes are much higher than stated intentions. This logically implies some amount of this kind of rationalization, especially since general and personal ideals are fairly similar to each other, and to happiest-outcomes. In other words, women’s responses to these three desire-related questions don’t evince huge divergences. It’s only when we switch to intentions that there’s a big numeric drop.

But can we predict rationalizations? One theory is that ideals/desires represent one kind of preference, but that preference also competes with other preferences, and women might rationally trade off fertility success for something else. There is some very strong suggestive evidence out there that one trade-off would be work: as women value the workplace more, maybe they trade off their fertility, and so their intentions fall below their ideals. They continue to express high ideals, but their lower intentions reflect the fact that they actually want other stuff.

I can test this. In two of my survey waves (so n~2,600 reproductive-age women), I gave women 10 life priorities they could rank. One of them was a “meaningful career.” Here’s the effect on personal ideals, total intentions, and the gap between them, with 95% confidence intervals marked, of a change in ranking of 3 spots for “meaningful career:”

So, women who value a career more do have lower intentions. But they also have lower ideals, such that the extent of the rationalization gap is teeny tiny compared to the effect on ideals, and it runs in the wrong direction. In other words, while it is true that women could be valuing the workforce more and this could be impacting fertility preferences, it appears to impact ideals even more than intentions. Indeed, as women value work more, the gap between their ideals and intentions shrinks, whereas across our society we’re seeing it widen. So “nowadays women just value work more so it competes for attention with fertility” is probably wrong as a theory.

Rather than go through each other life priority individually, we can zoom straight to the kill and see if “Time with family” priorities predict rationalization. If people are facing competing priorities and just rating family lower and thus downgrading intentions below their ideals, time with family should predict rationalization!

Alas, it does not. Women who value family more do have higher ideals and intentions, but they have no difference in the extent of rationalization of intentions. To the extent women are simply preferring to prioritize other stuff besides fertility and family, it is not driving intentions down disproportionately below ideals.

It turns out, intentions undershooting ideals has little to do with these competing priorities.

It does have to do with some things though. The priority assigned to hobbies is kinda almost in the ballpark of significant, which would suggest that leisure preferences may compete with children. But elsewhere, I specifically asked women if a “desire to maintain leisure time” was a factor impacting their fertility decisionmaking: there was no meaningful association between leisure-ism and rationalization of intentions. There was a weak but qualitatively large association between rationalization and worries about household indebtedness, however, so financial factors probably matter. Women who report having caregiving responsibilities (such as to elderly family members) reduce their intentions further below their ideals than other women too, so it seems like caregiving duties push intentions below ideals.

But by far the largest effects are among women who report being far from extended family or women who report lacking a suitable partner: these two concerns have a very large effect on rationalization. Self-reported biological difficulty conceiving is also a component, though similar in size and significance to the effect of debt.

Rationalization may relate to many things, then, but it usually isn’t strongly related to “competing preferences,” but does seem to be related to “life not working out the way you hoped.” It’s largely women saddled with caregiving duties for elders, women in debt or with precarious jobs, women far from helpful family members, women who have not been able to find a partner, or women unexpectedbly encountering fertility difficulties (these last women also tend to be older).

So when we see the gap between ideals and intentions, it is a bad thing. It is not “women getting what they really want,” it is “women not getting what they want.” We should not treat expectations or intentions as the “true preference” and ideals or desires as castles in the air; they represent real desires and likely real baselines for identifying subjective wellbeing. Intentions are already a “downgrade” vs. what people want, and intentions undershooting ideals often reflects adverse outcomes.

Those outcomes aren’t all financial, so child allowances may not always address those issues. And this was gotten way too long already for me to do a whole bit on what’s driving kinship proximity, elder caregiving, infertility, or delayed marriage. But suffice to say, focusing on intentions or expectations at the expense of other measures is a mistake when trying to figure out what people want, i.e. where to look for lost subjective wellbeing.

Fertility Forecasts

I’m just gonna laugh at Scott’s “well but he didn’t provide sensitivity testing.” First of all, for my clients who pay for this stuff, you betcha there’s sensitivity testing, alternative scenarios, etc. But that’s beyond the budget of the typical reader. Again, I do my writing about political pronatalism as a side dish to my actual career as a demographic forecaster. I can’t just put all my work out there for free. But if Scott wants to work from the NLSY1997 or the CPS2018 completion estimates and provide some ASFRs for fertility from the latest data until completion, he’s welcome to have at it. I think he’ll find it takes quite optimistic assumptions to get there from here.

Secondly, sorry, did I miss the memo where people need to be doing sensitivity analyses to tweet graphs? Isolated demands for rigor are a means for stifling debate, not an elevation of it.

Conclusion

Scott compared two small and likely not-quite-representative longitudinal surveys which asked facially similar, but in reality quite different, questions about fertility expectations to randomly-selected-but-biased samples of women in 1979/82 and 2001. Observing that there was not a big change, he published an article, jumping from “no big change in these surveys” to “problems that do not exist.” Yet his actual estimates of incompletion are comparable to my estimates from many other surveys: estimates which do indeed show a worsening problem.

There were other problems with his approach, for example, he’s comparing incomplete measures of fertility and his measure of preferences isn’t a valid measure of preferences at all. In so doing, he chooses to focus his attention on these two surveys at the expense of numerous other surveys and actual vital registration data, which points do a different trend. Indeed, what we are observing in America today is a decline in fertility which is much faster than the decline in fertility intentions, and a decline in fertility intentions which is faster than the decline in fertility ideals, and a gradual decline in fertility ideals which itself may be associated with understandable sources of long-run pessimism: worries about overpopulation or climate change, worries about rising inequality, worries about stagnant or worsening life expectancy, etc.

There may be good reasons to oppose a child allowance (for example, a child allowance would almost certainly reduce labor force participation rates by a non-trivial amount), and Scott is right to challenge the narrative of economic pessimism that prevails in much discussion of the American economy. But ballooning home prices alongside explosive price-inflation in care and education all point to real cost-drivers in major cost-components of family life, even as real and measurable marriage penalties are empirically associated with lower marriage rates. Meanwhile, as noted, the troubling stagnation of US life expectancy vs. our peer countries over the last two or three decades points to a problem: U.S. income and productivity may be rising, but it is not yielding longer, healthier, or happier lives. Even as those of us interested in offering an apologetic for a dynamic, market-oriented mixed economy seek to defend the record of the last 30 or 40 years from naysayers who want to sabotage global trade or massively expand the reach of the state, we should also keep abreast of new problems like rapidly-falling fertility that speak to real cracks in the armor, and also note the existence of durable areas of non-improvement like life expectancy. These are areas where child allowances likely can help.

Note: Much of the data shown here, though not the DIFS data since that’s a proprietary survey for subscribers, is in a new file in the OSF link for responding to Scott.