Time Series Analysis of Italian Mortality Data

How deadly is COVID-19?

Chirag Modi
BCCP — UC Berkeley
13 min readApr 21, 2020

--

This post is part of a series describing work by us at Berkeley Center for Cosmological Physics (BCCP, UC Berkeley) on determining the impact on COVID-19 with data science methods. In this post, we outline our methodology for the time series analysis of Italian mortality data. Our analysis suggests that the total number of deaths has been underestimated by up to a factor of 2. An accompanying post goes in detail on our results and conclusions. The submitted medRxiv version of our paper (under peer-review) is here. Our data, analysis-code and updated version of the paper is on our github. Github post will be continuously be updated as and when new data is available.

Over the last few weeks, there is increasing evidence that in addition to the infection rate of COVID-19, the reported COVID-19 fatalities are severely under-estimated as well. When compared with historical mortality data, many countries and regions around the world (Italy, US, UK, Switzerland, Netherlands, and others) show a significant excess in the number of deaths that are expected around this time of the year. This excess seems to coincide with the pandemic, but the number of deaths officially attributed to COVID-19 do not completely account for this excess. Figure 1 shows this for Lombardia in Italy, which is one of the worst hit regions in the world.

Figure 1: Mortality due to all causes in Lombardia for the first 12 weeks of the year. Compared to 2015–2019, there an obvious excess after February 22. This is when Italy reported its first death due to COVID-19. The observed excess is not completely accounted for by the reported COVID-19 deaths. An important thing to note is that the historical mean over-estimates the pre-pandemic mortality in 2020.

To understand the true impact of COVID-19, we need to accurately quantify this excess. However this is challenging since it requires estimating the expected number of deaths that would have occurred in 2020 in the absence of COVID-19 (counterfactual). In the simplest case, we might be tempted to compare it with a simple mean of deaths from the past years. However as we note in Figure 1 and as we discuss below, a simple mean prediction for expected deaths likely gives biased results. Instead, we propose a propose time-series analysis to estimate this counterfactual and predict the excess deaths.

We do this analysis for regions in Italy and find an excess mortality of 52,000 ± 2000 by April 18 over the expected deaths based on the historical data. This is a factor of 2 larger than the officially reported COVID-19 deaths, which are ~23,000. We also use this estimated mortality to understand the lethality of COVID-19 itself by estimating age-dependent infection fatality ratios (IFR) and the infection ratios in region of Italy. In this post, we focus on the methodology of this analysis and detailed results are in an accompanying post in this publication.

Data

We perform the analysis for all the 20 regions in Italy, one of the hardest hit places in the world. We use the total mortality data due to all causes provided by the National Statistical Institute of Italy (Istat) for the period of January 1st to March 28th, from 2015–2020. It is a complementary dataset and independent of the official COVID-19 reported deaths, thus less affected by unknowns such as the completeness of tests. However it is not 100% complete and is not a random sample of towns. We make some assumptions detailed in footnotes¹ to account for this.

Estimating the expected Mortality in Italy?

Italy reported its first death due to COVID-19 on February 22nd. We assume that the impact of pandemic on the deaths in Italy was insignificant prior to this. Then our strategy to estimate the expected mortality is simple — we use the historical data to match the mortality in every region before the COVID-19 pandemic (Feb. 22) and then use it to predict the expected deaths in the absence of the pandemic. We estimate excess deaths over these predictions and compare them with COVID-19 reported deaths.

Beyond expected mean estimate

Many articles that have pointed out the excess deaths over the historical trend have simply assumed that the expected number of deaths is well modeled by the historical mean. However, looking at this year’s mortality before the onset of the pandemic and comparing it with the historic mean, we notice that it lies consistently below the mean. A similar trend is observed in other data sets, such as the one published by Swiss authorities. If we simply take the mean, we ignore precious additional information that is contained in this year’s mortality rates before COVID-19 started. In general we expect the mortality between different weeks not to be completely independent: A severe flu season, for example, leads to an increase in mortality over the course of several weeks with a characteristic shape (i.e. it correlates subsequent weeks). This year’s data suggests that the flu season has been mild (or had already passed by January), which means that the expected death rate for this year (without the pandemic) would have likely been lower than the mean. A simple historical mean thus likely over-estimates the expected deaths and hence under-estimates the impact of COVID-19.²

How can we consistently account for this additional information?

The novelty of our strategy is to build a model for this year’s mortality from the historic data and this year’s pre-pandemic mortality. This model is built to match the 2020 mortality before the pandemic as accurately as possible. We then use our model’s prediction to estimate the expected deaths post February 22nd — i.e. to estimate a 2020 without COVID-19.

We use two different methods for constructing such a model:

Conditional Gaussian Process (CGP)

For this we assume that the data follows a multivariate Gaussian distribution. We estimate this distribution from the historic data. Specifically, we take the historic mean from 2015–2019 to be the mean of this distribution. To estimate the covariance matrix, we use the first two principal components since they account for 90% of the variance in the observed data while minimizing the error due to small sample size.

For our prediction, we treat the 2020 mortality data as a sample vector drawn from this multivariate Gaussian distribution. Then, using the fact that we have observed pre-pandemic data for 2020, we can condition on these components to update our expectation post pandemic. From this we get the conditional expected number of deaths after the beginning of the pandemic. At the same time, we also get a conditional covariance matrix which gives an uncertainty on this prediction.

Synthetic Controls method (SCM)

This is an ad-hoc data driven method, with minimal assumptions regarding the underlying data distribution. This method estimates the counterfactual as a weighted combination of the historical data (control). To obtain this these weighing for various past years, we again try to minimize the difference between the counterfactual and the observed data for the pre-pandemic period.

We use the L2 norm of this difference vector between the observed and predicted mortality. This is akin to assuming a Gaussian, constant variance distribution for the pre-pandemic data. As a regularization, we put a positivity and unit L1 norm constraint on the weights. If we knew the error or noise for different observations, or the exact distribution of the counterfactual, this cost function can be modified accordingly to take them into account.

Once the weights are estimated, the features (weeks) in historical data corresponding to the post-pandemic period for 2020 are weighted the same way to predict the expected observations.

Advantages

These methods have important advantages over a simple historical mean

  1. They automatically account for hidden variables that might otherwise affect mortalities in a year.
  2. This takes into account seasonal variations. We see that this is important since if not for COVID-19, all data points to 2020 being a milder flu year.
  3. It provide an uncertainty estimate on our predictions. Based on this, even though we see a difference in means for ages below 50, we find that it is not statistically significant. These errors are also propagated to our IR and IFR estimates.

We show the prediction of these methods in Figure 2. For comparison, we show the historical mean. As we had expected, our predictions are a much better fit to the observed 2020 mortality data before the pandemic.

Figure 2: Validating predictions for the pre-pandemic data. We show the same weekly mortality for Lombardia as Figure 1, but with the CGP and SCM predictions in the absence of COVID-19 based on the historical data. The 1-sigma confidence interval for CGP estimate is also shown. Our predictions trace the observed pre-pandemic data better than the historical mean and hence are more accurate. The vertical line is Feb 22, where the pandemic hit.

Italian COVID-19 Mortality Count

We repeat this exercise for different regions in Italy to estimate the total death toll for Italy and compare it with the reported COVID-19 fatalities. Some of the most affected regions are shown in Figure 3. We have conservatively extrapolated our predictions up to April 18th which leads us to believe that our numbers underestimate the total COVID-19 mortality³. Despite these conservative extrapolations, our analysis suggests that the total COVID-19 fatality count is 52,000 ± 2000, compared to the official number of 23,227. In the Lombardia region, over 20,000 people have died as compared to reported, and in Bergamo over 6000 compared to official statistics of 2500.

Figure 2: Excess mortality for different regions in Italy compared to reported COVID-19 deaths. We show cumulative excess deaths, over the predicted counterfactual deaths from two different methods — SCM and CGP. We compare this with the reported COVID-19 deaths for the period since February 23rd (available COVID-19 data). We find that our estimated deaths exceeds COVID-19 reported deaths by multiple factors for every period and every region, even those which officially do not report huge death toll. We extrapolate the our predicted excess beyond March 28th with dashed-lines under conservative assumptions³.

Using mortality to estimate IFR for COVID-19 and IR for regions in Italy?

Having estimated the number of fatalities due to COVID-19 more accurately, we turn to important questions about the disease itself. We would like to estimate infection ratios and fatality ratios in Italy. To do this, we make primarily use of the defining relation

This relates the Infection Fatality Rate (IFR), with the Infection Rate (IR). The infection fatality rate is the probability of dying if one is infected. The infection rate is the fraction of the population that is infected. The IR is critical in determining social and policy issues since it reflects where we are in the trajectory of a pandemic — whether we are in the stage of exponential growth or herd immunity. Epidemiological models suggest, and we assume here, that the IR is constant across different age groups in a region.

With this relation we can go one of the two ways -

1. We can assume an IR for the population to estimate IFR.

For this, we consider the Test Positive Ratio (TPR).This is the fraction of tests that return positive to the total tests conducted. We assume that this is currently an upper bound on the IF. This is expected since given the current strategy of testing people only with severe symptoms, the total population is likely less infected than the tested population. Hence using this ratio as an upper bound for IR gives a lower limit on the IFR.

We use the testing data for Italy to estimate IFR for different age groups, and find consistent results across different regions. We estimate a lower bound on the total population IFR of 0.6% for Lombardia that varies from less than 0.5% under 50, to over 1.6%, 4.3%, & 9%, for age groups of 70–79, 80–89, and above 90, respectively.

2a. If we have a reliable, independent estimate of IFR, we can use it to estimate IR.

The Diamond Princess (DP) cruise ship had 100% testing and hence provides a mostly unbiased (with respect to the positive fraction due to different testing strategy) estimate of IFR. It tested 330 people to be positive for COVID-19 and so far, we have 11 deaths from this sample. We assume Poisson error distribution on these deaths. For a small mean (11), these errors are large and dominate our error budget. As a result, we use this large bin of 70–89 and do not consider 70–79 and 80–89 separately. Instead, we weigh the fatalities in Italian regions to match the age-distribution on DP. With 11 deaths in 330, we estimate an IFR of 3.3% for the above 70 age-group on DP. Combining this with the age-adjusted fatalities from aforementioned methods, we estimate the IR in these different regions. This estimate relies on the assumption of an age-independent IR.

Our estimated regional IR peaks in Lombardia at 23% (12%-41%, 95% c.l.), and for the province of Bergamo we even find 67% (33%-100%), corresponding to 2 Million and 0.5 Million infected people, respectively. This strongly suggests that Bergamo has reached herd immunity. Other regions with widespread infection are Emilia-Romagna 14% (7%-25%), Piemonte 10% (5%-18%) and Marche 11% (6%-20%).

2b. Estimates of IFR for other age groups, we can use it to estimate IR.

Under the same assumption of age-independent IFR, for the IR estimated above, we can then estimate the IFR for other age-groups by matching the IFR for 70–90 age groups to DP. These estimates for Lombardia are — 0.65% (0.32%-1.16%, 95%c.l.) for 60–70, 2.60% (1.30%-4.66%) for 70–80, 6.76% (3.37%-12.1%) for 80–90 and 14.7% (7.35%-26.3%) for age above 90 and consistent with our lower-limits.

Conclusion

Over the last couple weeks, we have realized that a lot of deaths due to COVID-19 are not included in the official statistics due to insufficient testing. Thus looking at the total mortality count from all causes as an alternative and doing a proper statistical analysis with it can be useful. The idea here is to compare the observed 2020 mortality with the expected 2020 mortality in the absence of the pandemic. However using the simple historical mean for the latter can be misleading. It also does not make use of a critical piece of information, that the mortality before the onset of the pandemic can be used to estimate the expected trend in mortality this year.

To make use of this, we propose two different methods -

  1. a Conditional Gaussian Process (CGP) method that makes an assumption on the underlying data distribution being multivariate Gaussian and returns a mean as well as an error estimate on the prediction.
  2. a Synthetic Controls Method (SCM) that makes minimal assumptions on the data distribution and matches the pre-pandemic observations with a weighted average of historical mortality.

We find good agreement between both the methods and this makes our analysis more robust. We do this analysis for Italy under conservative assumptions and estimate the total mortality count of COVID-19 to be higher than official numbers by a factor of 2.

We also combine this estimated mortality with the available COVID-19 data for Italy as well as data from Diamond Princess cruise to estimate the IFR for COVID-19 for different age groups, as well as the spread of infection in different regions of Italy. These estimates for IFR have highest significance from Lombardia and depend heavily on age groups, ranging from — 0.65% (0.32%-1.16%, 95%c.l.) for 60–70, 2.60% (1.30%-4.66%) for 70–80, 6.76% (3.37%-12.1%) for 80–90 to 14.7% (7.35%-26.3%) for age above 90. Our estimated regional IR peaks in Lombardia at 23% (12%-41%, 95% c.l.), and for the province of Bergamo we even find 67% (33%-100%).

Our detailed results are in this accompanying post. In the subsequent posts, we will use these estimates to look at the impact of COVID-19 in different regions like New York City where preliminary analysis suggests infection rate of 25% already.

Footnotes

  1. We have data from 1680 towns. Since complete data is only available for a subset of towns, we need to evaluate the completeness per region and re-scale our estimates to obtain regional mortality. We estimate this factor for every region to be the ratio of the sum-total population of the towns in our dataset with the total regional population, as per the 2019 population from Istat website. We validate this scaling by comparing the proportion of mortality in these towns with the region as well and find it to be consistent. We remove from the analysis all the regions with less than 10\% completeness. As we have no way to quantify the error associated with this, we will use the most complete region (Lombardia, 72% complete) and province (Bergamo, 74% complete) for much of the quantitative analysis in this article, but we show that other regions give consistent results. Furthermore, given that we find good agreement between reported COVID-19 deaths and our estimated excess fatalities for ages below 60 years, we expect this scaling to not bias our results significantly.
  2. Some arguments have been raised that due to the milder flu season and less flu fatalities, a large older population survived to be infected with COVID-19, hence resulting in more fatalities than otherwise expected. Without weighing the ethical and social merits of this argument, we only quantify that this number. As compared to the historical mean, Lombardia has 1532 less deaths by February 22 (compared to yearly variance of 680 for this period) while our estimated excess for COVID-19 fatalities are over 20000 (with standard error of 247) from February 22 to April 18.
  3. Though we have total mortality data only up to March 28th, we extrapolate to April 11th under the conservative assumption that the official reported COVID-19 mortality will catch up with the true excess mortality. After that, we assume reported COVID-19 fatalities accurately account for excess deaths. Given increased testing, the difference between these two numbers should decrease with time. However reaching exact matching by April 11th is probably still an optimistic assumption. Furthermore, given that the pandemic will last longer than April 18th (and given the time lag between infection and death), we believe that the numbers quoted here will still be an underestimate of the total COVID-19 mortality.
  4. In using IFR from DP, we have assumed that IFR in a given age group does not depend on other factors such as differences in co-morbidities or health care systems between the locations. This can be verified with tests performed at random or for an entire population, which are currently not available except for DP. It requires a very large number of tests to accumulate large enough fatality statistics.
  5. We assume the same IR for all the age groups in a particular region, inspired by epidemiological models. This could be to some extent verified by the total positive rate in tests as a function of age. Data for this exists but is unfortunately not published. However our age dependent population fatalities from the province of Bergamo, which serve as a lower limit to the IFR do not depend on this assumption since we assume IR=1. Bergamo province has very likely reached the herd immunity where IR is less likely to be age dependent, a situation very different from the low IR environments where one may expect more age dependence

--

--