A Sad End

Lyman Stone
May 12 · 17 min read

So Scott Winship has responded to my response to his response to my response to his article. And he’s said it’ll be his last word on the subject. That’s disappointing because if you follow the actual lines of debate here, we’re really actually getting somewhere. It’d be pity to let things drop when there’s actual progress being made!

I’ll take Scott’s piece by theme:

Tone and Discourse

Scott says:

“Lyman acknowledges he made several errors — some of which he found himself after looking at my data, some of which I subsequently found — but he doesn’t seem especially chastened or chagrined about it. (Sample response: “doh! that was a gaffe!”) He dismisses them because, well, he was in a hurry to respond.”

Scott says:

But instead of taking responsibility, Lyman, breathtakingly, suggests I should be grateful that he repeatedly falsely accused me of making errors, saying,

Scott says:

Lyman also thinks I criticized him “for engaging on twitter.” But my criticism was not that he uses Twitter. My criticism was that he often tries to make complicated empirical arguments with data, opaquely, on Twitter. I also criticized him for snark, which he’s doubling down on in his response (“Man, really gotta do your lit review!”).

Scott says:

It is never fun to have to defend empirical claims. In the best-case scenario, you’re right but you create an uncomfortable situation. In the worst-case scenario, you discover you’re wrong.

This I think all just speaks to a cultural difference. I see adversarial processes as basically good and helpful. Scott does not. Scott thinks I should be “chastened” or “chagrined” because I made an error. I simply don’t understand why I would feel that way; when one goes for a pickup game of baseball, one does not feel chastened because one sometimes has a strike. That is the game. There is no embarrassment in being wrong. Indeed, a low rate of errors implies that a person is taking far too few shots, and being too cautious. It feels like Scott thinks that there’s some kind of personal pride or reputation on the line here, when there is not; there is nearly nothing on the line, which we both should know since we’re posting on Medium and our stats are easy to check and these posts are not getting wide readership.

Scott says that in a worst case scenario your empirical result is wrong. But this is obviously backwards. Being wrong is the best scenario, because if somebody proves you’re wrong then you nearly certainly acquired new knowledge. If you publish something you’ve been working on for a while, you already know what’s in it, and if nobody can show it’s wrong, then publication gained you little or no new knowledge. Nor do I agree that “defending an empirical result is never fun;” defending an empirical result is extremely fun, and I mean that in the very most literal sense that it is entertaining and enjoyable to argue about data. I’m not sure what’s supposed to be not fun about that. I chose it as my career precisely because I think it is fun. I don’t know why defending empirical results would create an uncomfortable situation. I don’t feel uncomfortable that I had errors in the spreadsheets I cobbled together to respond to Scott, as you can see (and as apparently Scott finds inappropriate), it is not bothersome or irritating to me to be proven wrong. Being proven wrong is practically the goal: we enter the intellectual debate demanding that the other side show us where we are wrong. That is the objective.

Scott says I should “take responsibility.” For what? For… being hurried and rushed? Sorry, I have a different job and a toddler. He says it’s my own fault I was in a rush; I could have taken my time to respond. Yes I could have, and in the process people would have read Scott’s article and seen that I had made no response to it; nonresponse is nonparticipation in the debate, and nonparticipation is abdication of one’s intellectual duty.

What would “taking responsibility” even look like? Perhaps publicly in an article saying I was hurried and rushed and made mistakes, and saying so several times at several different points? I already did that. I guess I was supposed to flagellate a bit more and feel terribly bad that I did the job of a public intellectual of putting a published idea through the ringer.

What is the aim here? Again, it seems like Scott just has a model of the intellectual process that is deeply foreign to me. When one publishes something, one hopes that people will throw rocks at it. That’s how you get better. When one enters the arena, one hopes to find somebody else on the other side. One does not ask the other team’s goalie to apologize for defending the net. Defeat, failure, error, they are only embarrassing if they come accompanied with malfeasance. Giving it a reasonably good go and losing a round isn’t embarrassing, and it’s not clear to me why Scott thinks it would be. I have no idea where Scott is getting these apparently quite personal stakes. Yes, Scott should be thankful that the only other researcher he cited and tagged took the time to engage with the article and mount a response. Peer reviewers are doing a worthwhile thing even if they do a poor job of it, because a poorly done review reveals good work as good! And look folks, there’s no debate at all that in terms of work product quality Scott’s files he shared have the spreadsheet I shared absolutely smashed. It’s good work!

But this also gets to the bit about “falsely accusing.” He makes it sound like I accused him of murdering someone. I accused him of making an invalid comparison. There are zero moral stakes here. Nobody’s reputation or standing is on the line. And as I’ll note below, there is not a substantive disagreement about the trend in question now that Scott has taken the time to review the data beyond the NLSY. So I really don’t get where Scott is getting this tone of personal attack from, especially since quite literally he is the one who chose to write about me and my work!

I will freely admit to being irritated. I think it was uncollegial of Scott to make zero effort to engage or even do a heads-up on the front-end. Scott says he really truly did mean “shots fired at everyone,” which is why I’m the only person he tagged or cited?

There’s also some weird confusion here. I really think this is a case where Scott just isn’t recollecting clearly what he wrote. He says:

“Lyman has written the most empirically-informed arguments of which I am aware that women are less able to meet their fertility goals, so I tagged him on Twitter and acknowledged his research (by linking to multiple pieces). I didn’t mention him in the article, because the piece would have grown too long had I expanded beyond my narrow point about the trend in the data I analyzed, and I wasn’t going to mention his work without critiquing it fairly.” (emphasis added)

Sorry, hold on. “I didn’t mention him in the article”?

This is just getting silly. Scott clearly isn’t even aware of what he himself actually wrote. He feels he was not provocative and intentionally targeting a colleague in an unfair way in part because he literally thinks he did not even mention me in the article. I do not know what to make of this.

Regarding snark, yes, I am a human alive in the 21st century; I am sometimes snarky. Price of the century. Maybe the 1990s were snarkless; I was in kindergarten. Regarding Twitter, I think Scott’s argument is “don’t make complicated arguments on Twitter.” That seems like a very bad argument for the extremely obvious reason that true arguments will be predisposed to be complex, since reality is complicated. Limiting a medium to very simple arguments will tend to bias it towards false arguments. Perhaps that explains a lot of Twitter! But while it may be the reality of Twitter, it is not good advice. Scott additionally thinks my arguments on twitter are “opaque.” Scott has often said my work is opaque, and on multiple occasions he’s made a point of telling me he finds my graphs hard to interpret. I am not sure this is quite the criticism Scott thinks. And when it comes to who is perhaps better able to communicate things on Twitter in ways people understand and find interesting, I would simply suggest that 21,160 is a larger number than 8,377.

Note: Scott correctly flags that I misidentified the affiliation of The Dispatch. There are so many different think tanks and publications out there that I can’t always keep track of whose affiliations are what. I assumed given the overlap in personnel that there was a formal affiliation between AEI and The Dispatch, but evidently there is not. My bad!


The big takeaway to me on the NLSY stuff is Scott has no defense, as in literally none. He doesn’t engage a single one of the critiques, and instead, I am not joking here, calls me an anti-vaxxer, which is really just a wild twist in the debate.

He says:

Lyman’s latest has more criticisms of the analyses I did using data from a survey known as the “NLSY.” He goes on about the problem of selective attrition (people who drop out of surveys being different than those who keep participating), but his view of whether inferences may be drawn from panel data (from surveys that follow people over time) is, to use a technical word, goofy. If researchers let attrition and nonresponse prevent them from analyzing survey data, there would be practically no survey data to analyze. Lyman has used the NLSY himself, as have hundreds of other researchers. This is a red herring.

Of course, I did not say people should not use the NLSY. I said we should not presume that the NLSY is representative. If you think a survey recruited in peoples’ teens is still a representative sample of 35 year olds 20 years later then I guess you’re free to believe that but you’re clearly incorrect.

I did not say that inferences cannot be drawn from panel data. I specifically identified a kind of inference which can be drawn:

For all these reasons, it’s best to simply say, “We observe X among NLSY respondents,” rather than, “The NLSY suggests X happened in society generally.”

It’s literally right there. My argument is against generalization to the wider population based on the NLSY, not against analyzing the NLSY for what it is: a fascinating and interesting-in-its-own-right sample of two cohorts of people in America! So for example, if a researcher wanted to know, “What effect do state tax rates have on interstate migration?” they might use the NLSY, and they might conclude that among NLSY respondents, it is indeed the case that state tax rates influence interstate migration. But if we want to see what’s happening in the population on the whole, we would not use NLSY. So for example, for state tax rates, we would go to the NLSY to corroborate and demonstrate a mechanism for some relationship we separately observe at the general population level. We’d never want to use the NLSY to actually estimate the net migration rate in or out of a state; that would be insane and totally unreliable. Rather, we want to use the NLSY to see if a relationship we theorize exists in the general population can be validated in a longitudinal panel. This was especially important in the past when longitudinal administrative record linkage was very rare and difficult. As such record linkage is getting easier, it’ll be interesting to see if longitudinal surveys continue to draw research funding and interest. But more to the point, we should use NLSY as a discrete test of a specific mechanism or effect, not as a way to say what the general population traits are.

What Scott wants to do is say, “NLSY women are actually a good indicator of what is happening to all women.” I think that is a weak argument given the problems in the data. Attrition in the NLSY, and population change in the wider population outside of NLSY’s sampling frame, is such that the persisting sample becomes ever-more-unlike the population at large, and so increasingly unhelpful for making population-level generalizations. Rather, a better argument would be to match NLSY respondents by state to something like state-level generosity of child credits and exemptions, and show something like, “In the NLSY, we observe that increases in per-child state benefits in fact do not increase odds that women have a child.” Whether this would hold for the wider population would be ambiguous, but that’s where you get into lit review and meta-analysis. But using small-sample cohort studies is not a great way to estimate population parameters, and comparing across such studies is extremely precarious.

So I didn’t say, “Don’t use the NLSY!” I said, “Don’t try to estimate population parameters from the NLSY!” What Scott is trying to estimate is a population parameter.

Scott goes on to say:

His views on weighting (up- or down-weighting the influence of people in the data relative to each other to ensure the sample reflects the broader population of interest) are also…highly unusual. (I mean that not in the sense that Galileo had some highly unusual ideas about the solar system, but in the sense that Jenny McCarthy has some highly unusual views about vaccines.)

That’s it. That’s the whole response to my argument about attrition. I actually had to google who Jenny McCarthy is, and yeah Scott literally just compared the claim “Weighting on observables cannot correct for selection on variables of interest” to “Vaccines cause autism.” I can’t even wrap my head around that.

Folks, it’s an admission of defeat; and not “I’m exhausted let’s stop” admission but “I have no rebuttal, let’s accuse you of thinking vaccines cause autism to distract attention” admission (for the record: vaccines do not cause autism). Rather than actually address the question, Scott dismisses it out of hand as “highly unusual.” Reader: it’s not unusual at all; I’m offering a pretty conventional account of what weighting can and can’t do, and weighting can’t fix sampling procedures that are selecting on the variables of interest!

Scott then says:

I’m not going to analyze his latest analyses because I don’t have time for this silliness. Lyman has swung and missed regarding the wrongness of my NLSY analyses too many times to count. He is, simply, to be dismissed here.

Lyman ends with a bunch of stuff I don’t view as that relevant to our dispute.

I mean I think I swung and missed, depending on how you count it, either 1 or 2 times. Which isn’t that high to count to.

But like I (and he himself) said, Scott isn’t bothering to engage the substance of the argument. He said we should take it off Twitter into a longer format, then when I do that, “He is, simply, to be dismissed here.” And there’s a bunch of stuff that isn’t relevant. Reader: it is relevant! Whether or not we should use intentions or ideals kinda matters! That Scott wants to argue that changes in “expectations” are dispositive of changes in preferences is kind of a big problem in his argument!

Fertility Completion and Preferences

My argument, from the beginning, has been that we shouldn’t directly compare NLSY1979 and NLSY1997 because they have meaningful differences. I then later argued, erroneously, that Scott had mismeasured the change over time. That argument of mine was wrong. But the initial argument, which had I been wise I would have stuck with, was correct. Directly comparing NLSY1979 and NLSY1997 is not valid. Scott did not respond to any of the critiques related to the validity of comparison.

But what’s irritating here is that while I don’t think the comparison is appropriate, as I showed previously, Scott’s estimates are entirely compatible with “things have gotten worse over the last two decades.”

The point change between his mid-30s estimates and my late-40s estimates are totally compatible. Now, I did not expect Scott to respond so quickly, so published my response at 2:30 AM, then woke up, ran some errands, got home and did some further edits earlier today, so it’s possible Scott didn’t see the latest version of the post on this particular point (note: the edits were all published before Scott’s response; I haven’t made edits since he published his response). I just didn’t think Scott would respond next day, so didn’t figure having a few tidying-up edits to my blog post would matter much.

But the point is, these estimates aren’t incompatible. One can believe:

  1. That NLSY estimates are not good indicators of population parameters in general
  2. That NLSY1979 and NLSY1997 should not be seen as directly comparable

and also

3. That NLSY1979 and NLSY1997 appear broadly consistent with other estimates which may actually be better at identifying population parameters

And that is in fact my view.

But it gets even more irritating. Scott has spent all this time saying things haven’t gotten worse, but it turns out using his apparently preferred metric (a survey of high school girls compared to UN-estimated TFR), things actually have gotten worse!

That peak around 1997 happens to be almost exactly the NLSY1997 cohort, while that trough at the beginning is almost exactly the NLSY1979 cohort. So Scott has two cohorts: one surveyed near the historic low ebb, one at the historic peak, and says “nothing to worry about!”

But clearly, things are getting worse since 1997!

Scott says these are just projections. Yes, true, they are just projections. But when we are making policy decisions with decades-long effects, projections are what we have to use. I’m not sure how you can make policy for the future without relying on projections.

Here’s how actual historical TFR compares to actual fertility once women are done having kids. TFR here is coming from US vital statistics reports since 1935, a mix of US and state reports I have hand-tabulated 1915–1935, state reports I have hand tabulated 1842–1915, and estimates from the IPUMS USA sample using the “yngchild” variable. For 1800–1850, I am using a TFR estimate similar to that reconstructed by Hacker (2003).

But I also adjust for child mortality. I account for child mortality using child mortality rates for the US from the CDC (1968–present) or Gapminder (1800–1967).Basically, I am deleting out the expected number of kids who would have died before age 15 at mortality rates in a given year, so that this series can be corrected for the huge change in child death rates that occurred over the sample. This helps us get a better sense of “final family size,” which is more likely to have a relationship to preferences than “crude number of life births.” I have aligned actual TFR in a given year and completed fertility to the year women were 25, as that’s generally around when peak fecundity occurs. My estimates are very similar in trend and level to those in Hacker (2003), which is one of the most important and heavily-cited historical reconstructions in the academic literature; I improve on it by fitting the mortality-adjusted own-child-based decennial estimates of TFRs and CFRs to annual-level data on births from the states that had systems of vital registration in operation at various points 1800–1935.

TFR and CFR are not identical. CFR is more stable in many periods, but not all; for example, CFR declined by quite a bit more in the late 19th century than TFR did. But these values are extremely tightly correlated. Will CFR for women who are 25 in 2020 be 1.64, as TFR was? No. It’ll be higher. But not necessarily lots higher.

Keep in mind those CFRs include immigrant women, many of whom had their kids in other countries, so some of the CFR>TFR difference is just due to inconsistent frame of reference.

Now, look. I promised myself I would not be goaded into doing the whole ASFR forecasting exercise. But who am I kidding? I enjoy it too much!

So let’s take just one exemplary scenario for ASFRs going forward. Here are actual historic and just one possible future of crude birth rates at selected specific years of age:

You can see the COVID blip down in 2020 and 2021, and then a bump back up in 2022, then some return to trends after that. Seems fairly plausible!

We can then use data from the 2018 June CPS Fertility and Marriage Supplement (the latest for which I have the microdata) to look at the population distribution of women who have already had a given number of kids. Here’s what average parity-thus-far looks like by birth cohort of women in recent rounds:

We’ll focus on 2018 as the most recent baseline.

So essentially, we want to find women of a given year of age in 2018, and then add onto their fertility-thus-far the amount of fertility we still expect for them given our ASFR predictions. So for women who are 37 in 2018, we want to add to that the ASFR for 38 year olds in 2019, 39 year olds in 2020, 40 year olds in 2021, 41 year olds in 2022, and so forth until age 50. For simplicity, after 2035 I simply straight-line the ASFRs. Here’s what that yields:

Plugging this all into the graph above, here’s what we’re looking at:

Is this what will actually happen?

No! Because CPS obviously has sampling error, and fertility rates will not be what I’ve forecast, and there will be immigration, emigration, and mortality, all creating inconsistencies in the sampling frame. But the reality of the situation probably won’t be too far from what I’m showing there, barring some really quite extraordinary unforeseen changes.

Ergo, it’s very, very likely that CFRs will decline.

And as an aside, we should not act as if delayed fertility is some zero-utility-loss proposition. Having kids later, when you are in poorer health as a parent, and having fewer healthy years with grandchildren, and ultimately fewer years of life in general shared with descendants, is plausibly a welfare loss even if ultimate desired parity is met. Now, there are also welfare gains associated with delay: possibly more economic stability, for example, or more relationship stability. My point isn’t that having kids is necessarily better, just that we should not act like it’s irrelevant when people have kids; for some people delay may be good, but for others the welfare costs may be quite steep.


I’m disappointed how all this ended up, but also confused. Scott says he wishes we could have found some way to collaborate. Me too. One way we could have collaborated would have been if Scott had perhaps emailed me at some point and said, “Hey, I’m doing this thing, what do you think of it?” instead of writing an article on the exact topic AEI commissioned a report from me on, identifying me by name, then tagging me in the tweet thread with additional critiques therein. Usually collaboration begins with that sort of thing.

But look, Scott ended his response by quite literally copying and pasting the conclusion to his prior response. So what happened here is we shared data, Scott wrote a response, I wrote a detailed and thoughtful response to that, and Scott responded with swearing, dismissal, refusal to even address the arguments, and finally his own cut of the data showing that his data happens to coincidentally sample women at the two most extreme points in the data, and that by almost any reasonable metric, things have gotten far worse since the NLSY1997 wave. Which is to say, if you want to know what things are problems in 2021, maybe perhaps don’t use a survey with “1997” in its title. Such surveys are great for figuring out what happened in the past or for demonstrating a theorized mechanism, but they just aren’t barometers of “how things are going now.” As a result, they have next to no relevance for “should we have a child allowance now.”

In a State of Migration

People Move. I Ask Why.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store