COVID-19 in the USA: A state-by-state investigation

2020 in context of historical data. Do the official numbers tell the whole story?

George Stein, PhD
14 min readMay 9, 2020

In this post we look at the total number of deaths this year in relation to historical mortality trends in the 40 states with the most complete mortality data. To put coronavirus fatalities in context, we show how the number of deaths this year compares to the 2018 flu season in each state (the worst in a decade), as well as other years on record. We aim to let the data tell the story, without any political agenda. By comparing 2020 mortality from all causes to the official coronavirus death toll, we find in a few states clear evidence for an excess of deaths above those officially attributed to the virus, but importantly, we do not (yet) find this excess in all states. The Centers for Disease Control and Prevention (CDC) releases new data once a week, on Friday. This post covers data up to August 9th, and we will be updating the post every few weeks with new CDC information (last updated Sept 4th). This post is co-authored by Zarija Lukić.

The true death count of the coronavirus has been found to be significantly higher than the reported number in some countries. This under-reporting is not the result of some conspiracy or any carelessness of the reporters, or negligence of the hardworking medical professionals, but simply arises because, for an individual death, it can be uncertain (due to incomplete testing) or ambiguous (due to presence of other illnesses) if COVID-19 was the cause of death. While it is difficult to determine the exact cause of death, it is not so difficult to count the total number of fatalities that occur from all causes. The number of fatalities, or total mortality, tells us how many more deaths are occurring during the COVID-19 pandemic than during a typical year, and gives us a much better understanding of the situation than the official coronavirus numbers alone.

Stateside investigations have been widely reported, with evidence that a number of states have seen an increase in deaths above the historical average — in some cases above what is accounted for by official coronavirus statistics. We don’t want to speculate here whether these excess deaths have occurred directly as a result of the viral infection, or indirectly due to, perhaps, the mass disruptions of the healthcare system. But significant numbers are likely to be caused by the virus itself¹.

How does 2020 compare with the last 6 years?

We can estimate the impact of the pandemic state-by-state² by comparing the number of deaths that have occurred this year resulting from all causes, provided by the CDC, to those deaths officially attributed to COVID-19. Note that the available data is not yet complete and results are subject to change as data continues to roll in. Also note that the available data, which covers only up to August 9th, does not include the full extent of the pandemic, still ongoing as of this writing. Using the historical mortality data from 2014 through 2019 as a baseline we can ask two big questions:

  1. Are more people dying than usual and if so, do official coronavirus statistics account for this increase?
  2. How does the weekly death count compare to a bad flu?

To answer these we’re going to examine a number of illustrations, such as the following New York City example of Figure 1 that we walk through now:

To determine if more people are dying than usual we construct a baseline from the 2014 through 2019 weekly mortality, which gives an historical range of the weekly death rate (the blue region in Fig 1). Barring any unforeseen external forces, it would be reasonable to expect the death rate for a typical week to be somewhere in this range. Of course, 2020 had a massive unforeseeable force introduced — the coronavirus. We cannot say for certain whether the weekly deaths this year would have been near the low or high end of the historical range, but we can use the minimum, average, and maximum values of it as a notion of statistical uncertainty. It is only through this uncertainty that we can comprehend the significance of the large number of deaths reported. For instance, if we see 1,000 deaths above the monthly historical average, we need some historical context in order to understand whether this is an alarming number, or just a fluctuation consistent with previous years.

The difference of this historical baseline and the 2020 weekly mortality (the red line in Fig 1) then gives an estimate on the number of extra deaths that occur each week over the historical expectation. These are all of the deaths occurring in each state each week, and do not depend on knowing anything about the coronavirus. For example, we can see roughly 5000 more fatalities (6000 this year minus the 1000 as expected from previous years) in NYC in the week of March 29th to April 4th, and 6250 more fatalities in the week ending with April 11.

Then, including available information about coronavirus — its official death toll (including probable deaths¹), we can subtract the official coronavirus fatalities from the total number of deaths. If all extra deaths are accounted for as coronavirus fatalities, then after this subtraction the 2020 mortality should look similar to other years. If it doesn’t, and the black line in Figure 1 is still above or below the historical expectation, we have a “mortality gap”.

Figure 1: New York City mortality from available CDC data. The number of deaths in 2020 dwarfs any historical fluctuations, even when compared to the 2018 flu period — the worst flu in the last decade. A significant excess number of deaths — thousands of them — remain after subtracting all official deaths attributed to the coronavirus. The CDC continues to receive death certificates from the March to August period, so the shaded red and grey regions give an estimate of the full mortality. When all death certificates have been received, the true 2020 mortality is likely to reside near the top of the shaded region, but could also lie somewhere above or inside — we can’t say with certainty at the moment.

We can’t say for sure that these excess deaths are directly caused by the coronavirus disease, but it is not unlikely. In fact, a similar situation exists every year with the influenza virus, and the CDC of course knows that simply adding up coroner’s reports leads to an underestimate of influenza-related deaths. Thus, the CDC builds detailed models and uses statistical inference to estimate the number of excess deaths connected to the flu. In the case of COVID-19, this excess can be so large that one can estimate it without elaborate epidemic models by simply tracking total mortality from all causes, as we did above.

States hardest hit by the coronavirus so-far

A glance at Figure 2 reveals that New York City, New Jersey, Massachusetts, Michigan, Colorado, and Maryland show a significant increase in weekly mortality above any values seen in the past 6 years. For example, adding up the weekly deaths between March 15th to May 30th, New York City has seen a 285% mortality increase (i.e. 3.85 times as many deaths) over this period when compared to the historical expectation. This far exceeds the roughly 15% weekly increase that was seen during the January 2018 flu period — the worst flu in the last decade (the little bump in the bottom left of the figure).

Importantly, the CDC uses death certificates to construct their mortality numbers which can be delayed by up to 8 weeks, so the current data for March and April in all states is preliminary and will continue to increase. However, as an updated dataset is released every week we can get a good sense of how long it takes for all of the death certificates to be included for each state. We compare the most recent dataset to previous ones to construct an estimate of the incoming increase (see footnote 5 for details), and we show this in the appropriate figures as the shaded red and black regions. The true 2020 mortality is likely to reside near the top of the shaded region, but could also lie somewhere above or inside — we can’t say for sure at the moment. The current data is shown as the solid red and black lines.

After subtracting the COVID-19 official deaths we still find an excess in 2020 mortality over the historical expectation in most of these 9 states, as illustrated by regions where the black line exceeds the historical data, shown in blue. This mortality gap gives an indication that the number of deaths due to the virus has gone under-reported in these states.

Figure 2: States that show the most significant increase in this year’s total mortality compared to historical expectations. The blue region with lines shows past years from 2014 through 2019. We emphasize 2018 as it had the worst flu in the last decade. The 2020 total mortality numbers from all causes (that the CDC has received as of the data release date) are shown by the solid red line. As March and April death certificates continue to roll in, there is some uncertainty as to what the true 2020 mortality will end up being — the red shaded region attempts to show how large this uncertainty is. The true 2020 mortality is likely to reside near the top of the shaded region, but could also lie somewhere above or inside — we can’t say for sure at the moment. The black line shows total mortality minus the COVID-19 deaths. Where the black line is above the historical data, it suggests that these states have an increase in deaths beyond what has already been attributed to COVID-19.

Data for other states

These states have already undergone extensive news coverage — but what does the same data tell us about the rest? We investigate the same effects in 32 other states (40 total). We do not include states where reporting is significantly lagging and data is still largely missing. For this reason we refrained from delving into the highly speculative analysis required to extend beyond the 40 that we focus on. In Figure 3. we show the most populous states that were not included in the figure above. A few show more deaths in March and April 2020 than in any year in the historical dataset, but none show the serious excess of deaths as seen in the figure above. Some also hint at a mortality gap — but again, most not as significant as above. Keep in mind that the 2020 total mortality data could be more incomplete than expected (this especially applies to the drop off on August 9th), and that the most likely 2020 mortality count is illustrated by the top of the shaded red and black regions. Take a look at Figure 8 at the end of this post to see all 40 states side-by-side.

Figure 3: Mortality trends in a sample of the most populous states. The blue region with lines shows past years from 2014 through 2019. We emphasize 2018 as it had the worst flu in the last decade. The 2020 total mortality numbers from all causes (that the CDC has received as of the data release date) are shown by the solid red line. As March and April death certificates continue to roll in, there is some uncertainty as to what the true 2020 mortality will end up being — the red shaded region attempts to show how large this uncertainty is. The true 2020 mortality is likely to reside near the top of the shaded region, but could also lie somewhere above or inside — we can’t say for sure at the moment. The black line shows total mortality minus the COVID-19 deaths.

How many more deaths have occurred this year than usual?

The increase in mortality over the March 8th through August 9th period in comparison to the historical range is clearly apparent by eye in a number of states. At the same time, some states don’t show as noticeable of an increase. This could be due to the low virus spread in those states and the shelter-in-place order; but we don’t want to speculate on reasons. Now, we move to a more quantitative analysis³⁴ (Table 1 near the end of this post summarizes this data).

We analyze the March 15th —August 9th period; we first calculate the total number of deaths included in the mortality data over these six weeks, and compare it to the expected historical range to estimate the recent increase in mortality. We then account for the official COVID-19 deaths, and determine the extent of the mortality gap for each state. These can be much more intuitive than simply looking at the number of coronavirus deaths that we’ve all seen already (and can see again in Figure 4).

Figure 4: Official number of COVID-19 deaths from The COVID Tracking Project (on a logarithmic scale) for the 40 states presented in this post as of our cutoff of August 9th, 2020. For NYC this count includes “probable COVID-19 deaths¹”.

We note that a few news outlets have performed a similar look at the mortality gap, but comparing 2020 mortality only to the historical average, and assuming that all deaths above this average are due to (or evidence for) the coronavirus. But they should really consider the range of historical data. We cannot say that without COVID-19, 2020 mortality would exactly be equal to the average of the last 5 years. For this reason we compare all statistics to the historical minimum, maximum, and mean, as it is likely that 2020 would have been somewhere in this range in the absence of the coronavirus (although even this is still not guaranteed!).

Before estimating the mortality gap, we look at the total increase in deaths over the expectation. This does not depend on COVID-19 tests or reporting — only the total deaths recorded by the CDC. In Figure 5 we show this total increase (in percents), where 0% means no change in 2020 mortality when compared to the average of previous years, -50% would mean half the deaths, and 100% would mean double. The error bars show the spread if we have taken the historical maximum and minimum instead of the mean. That is our estimate of the statistical uncertainty.

As expected, New York City displays the most significant increase in mortality, roughly a 273% increase over the normal, 3.73 times as many deaths. New Jersey is next with roughly a 100% increase in the total number of deaths over this period, followed by Massachusetts and Michigan with a 53% and 32% increase, respectively. We’ve again included an estimate on what this increase may look like when all death certificates have been included by the CDC. At the moment this is a speculative estimation on incoming future data, so the exact numbers will most likely be off.

Figure 5: Total increase of 2020 mortality over the historical average from March 15th through August 9th. Black dots show the CDC mortality data as-is, while red dots attempt to account for the incoming increase from death certificates not yet received. The error bars show the deviations from the historical minimum and maximum.

How big is the mortality gap?

In Figure 7 we show the mortality gap, determined by subtracting the deaths attributed to COVID-19 from the total number that occurred over the same 5 weeks as above. We find that New Jersey, New York City, Illinois, Massachusetts, Michigan, Maryland, Colorado, Wisconsin, Virginia, and Arizona have a lower limit of the mortality gap that is greater than zero.

Figure 6: The mortality gap for March 15th — August 9th. Black dots show the CDC mortality data as-is, while red dots attempt to account for the incoming increase from death certificates not yet received (note this is a speculative estimation on incoming future data, and is not guaranteed). The error bars show the deviations from the historical minimum and maximum. States with a negative mortality gap likely have incomplete data, as we see a much larger difference between the current data and after accounting for death certificate delays. Or perhaps actually have fewer non-coronavirus deaths than normal.

The mortality gap increases for most states when we account for the incoming increase due to delayed death certificates, and in this case many states then show a clear positive mortality gap. This is a speculative estimation on incoming future data, and is not guaranteed.

A significant concern in these types of analyses is the fact that the non-coronavirus related mortality in mid March to late April 2020 could be vastly different than historical. For example, what if coronavirus deaths are being counteracted by fewer deaths from “normal” causes during stay-at-home orders — for example by workplace and road accidents? Let us estimate the outcome of this scenario.

As a matter of fact, “unintentional injuries” are the third leading cause of death in the US — ranging from 34 deaths per year per 100,000 inhabitants in California, to 67 deaths per year per 100,000 in Kentucky. So it is not unreasonable to speculate that there may be more COVID-19 deaths than shown by analyses like the one performed above, but they are hidden by the decrease in deaths from normal causes. We can approximate the effects of the stay-at-home order by removing the expected number of accidental deaths for each state from the historical data, assuming they are constant over the year. This is probably an extreme assumption, as it is unlikely that all accidental deaths have entirely disappeared during the stay-at-home order, but it gives an idea of how large this effect can possibly be.

This approximation has a noticeable effect on the mortality gap that we calculate. We show this in Figure 7, where the gap has increased for each state. This increase is strongest for the states with the highest population as they would have had the most accidental deaths. But, the mortality gap does not dramatically alter the results reported above.

Figure 7: The mortality gap for March 15th — August 9th, assuming no accidental deaths during stay-at-home orders. Transparent markers show the result before the removal of accidents. Black dots show the CDC mortality data as-is, while red dots attempt to account for the incoming increase from death certificates not yet received. The error bars show the deviations from the historical minimum and maximum. States with a negative mortality gap likely have incomplete data, as we see a much larger difference between the current data and after accounting for death certificate delays. Or perhaps actually have fewer non-coronavirus deaths than normal.

Wrapping up

In this post I focused on mortality data through August 9th, 2020, for states with reasonably reliable publicly available CDC data. We’ve seen that some states show a significant increase in deaths over the historical expectations — and even a large mortality gap — while some states are still within expected mortality levels at this stage of the data. There are a number of things to keep in mind when digesting the types of analyses presented here:

  1. Data can be incomplete. This data is incomplete. Completeness varies state-by-state, and only time will provide the true mortality impact. Importantly, the mortality from August 10th until today may not follow the same trends.
  2. This analysis does not provide any conclusions on the nature of the virus itself. It only looks at the compounded effects of the spread of the virus, age distributions of individual states, overloaded medical systems, stay-at-home orders, data incompleteness, etc.
  3. To know what 2020 truly would have looked like without the coronavirus pandemic is impossible, and comparing to historical data therefore only provides a sense for the magnitude. We need to consider the full range of historical possibilities — not just the historical average values — in order to make any realistic statements about “a 2020 without the coronavirus”. Additionally, 2020 mortality from non-coronavirus related causes could be much different than historical expectations.

We finish with Table 1 summarizing the analysis with the currently available data, Table 2 showing what this might look like after all death certificates have been received, as well as Figure 8 showing mortality for all 40 states.

The work presented here should be considered the opinion of the authors and not necessarily that of the US Dept. of Energy, the University of California or the Lawrence Berkeley National Laboratory.

Table 1: This year’s March 15th — August 9th period compared to the range of historical mortality values. We compare to the minimum number of deaths each week seen over this period, the average number, and the maximum number. This is a likely range for what 2020 would have looked like in the absence of the coronavirus. Here we show the possible range of excess deaths in percent and absolute number, and “Gap over” is the mortality gap (the difference of 2020 mortality with official COVID-19 deaths subtracted to historical.
Table 2: This year’s March 15th — August 9th period after accounting for delayed death certificates, compared to the range of historical mortality values. We compare to the minimum number of deaths each week seen over this period, the average number, and the maximum number. This is a likely range for what 2020 would have looked like in the absence of the coronavirus. Here we show the possible range of excess deaths in percent and absolute number, and “Gap over” is the mortality gap (the difference of 2020 mortality with official COVID-19 deaths subtracted to historical.
Figure 8: Mortality trends in all states for which the available data is believed to be reasonably complete, sorted by population. The blue region with lines shows past years from 2014 through 2019. We emphasize 2018 as it had the worst flu in the last decade. A few states did not implement a state-wide stay-at-home order, although individual counties may have. The 2020 total mortality numbers from all causes (that the CDC has received as of the data release date) are shown by the solid red line. As March and April death certificates continue to roll in, there is some uncertainty as to what the true 2020 mortality will end up being — the red shaded region attempts to show how large this uncertainty is. The true 2020 mortality is likely to reside near the top of the shaded region, but could also lie somewhere above or inside — we can’t say for sure at the moment. The black line shows total mortality minus the COVID-19 deaths. Where the black line is above the historical data, it suggests that these states have an increase in deaths beyond what has already been attributed to COVID-19.

Footnotes

  1. For example, on April 9th, New York City increased its COVID-19 death toll by more than 3,700 to account for these ‘probable’ fatalities — people who did not have a positive COVID-19 laboratory test, but their death certificate lists as the cause of death “COVID-19” or an equivalent. Other states may soon follow.
  2. Given the significant focus of other investigations on NYC and NY state, we only show NYC here, and will refer to it as a state for ease. We also refer to the District of Columbia as a state. The other areas shown are actually states. For NYC we also use a separate dataset for COVID-19 deaths — that of the official NYC health, updated daily.
  3. We additionally scale the historical CDC weekly mortality to account for population changes over time. For each year we calculate the population growth using census data between the year in question and 2019, and multiply the weekly number of deaths by this factor. For example, if the population is doubled between 2014 and 2019, we multiply the number of deaths in each week of 2014 by a factor of 2. As well, each “year” of the CDC data does not start on the same day, so we shift the data accordingly, interpolating between neighbouring weeks when necessary.
  4. Whenever the dataset reports <100% complete data we scale up the number of deaths by the inverse of the completeness. For example, if the completeness is listed as 80% we multiply that week’s mortality by a factor of 1/0.8=1.25.
  5. We assume that the cumulative multiplicative update of the previous dataset to the current one applies into the future. Meaning that, in the most recent data, if the number of deaths in the leading week of the previous dataset increased by 20%, the second week by 10%, the 3rd week by 5%, etc., that the number of deaths in leading week of this dataset will go up 20% next week, then 10% the following, etc… . This is a significant assumption, especially in the middle of a pandemic, and the true deaths may end up higher still. But, it at least seems to hold over the past few data releases. Additionally, we assume that the reported COVID-19 deaths are complete up to August 9th.

Acknowledgements

Thanks to Zarija Lukić for many discussions and additions on this analysis, and Dana Simard for lots of much-needed editing help!

--

--

George Stein, PhD

Postdoc at the Berkeley Center for Cosmological Physics (BCCP). Working on cosmology, data science, and machine learning methods.