COVID-19: Estimates of true infections, case fatality and growth rates in Germany

Clemens Schmid
Stephan Schiffels
Published in
10 min readApr 2, 2020

by Clemens Schmid and Stephan Schiffels (both Max Planck Institute for the Science of Human History Jena)

Acknowledgements: We got some valuable input and corrections from Martin Lange and Johannes Boog (both Helmholtz Centre for Environmental Research Leipzig)

Disclaimer: We have no epidemiological training and share these results without warranty of any kind. They should not be used as a basis for decision making and we refer to the respected authorities (e.g. for Germany the Robert Koch Institute) for reliable information and models. This post is only an interesting exercise in data analysis.

Note: Analyses in this post are from April 2nd, 2020, and naturally include only data from before that date.

The COVID-19 pandemic has taken its toll all around the world and caused (so far) hundreds of deaths in Germany. In this post we present current data and model estimations for multiple relevant parameters (e.g. current number of real infections and number of future deaths) for Germany.

In the context of the #WirvsVirus hackathon we started to work on the R package covid19germany that allows to download and visualize the current numbers of confirmed cases and deaths by administrative units. We use this package to access the data for this post. The code for this post can be found here. Furthermore the package comes with a webapp that allows to explore some of the following data and analyses in further detail — not just for the whole of Germany, but also for smaller administrative units as well as gender and age classes.

Quick overview about COVID-19 in Germany (2020–04-01)

The number of confirmed COVID-19 cases in Germany is rising daily, but it is unclear to which degree new infections are taking place or testing is simply catching up with past infection events. Germany may be one of the countries where testing covers a higher proportion of infected cases as the testing abilities are comparatively good. As testing will always lack behind the actual number of infected it is still an unreliable estimator of the true dimensions of this pandemic. The number of deaths caused by COVID-19 is a more trustworthy indicator — though with a significant temporal delay. More about this later.

Evolution of new daily and cumulative cases in Germany by federated state (Bundesland)

The increase of infected and deaths follows an expected acceleration trend due to exponential disease expansion with a growing number of spreaders. Dips on the weekends, especially of the number of positive tests, might be an effect of reduced working hours and reduced information transmission in and by health care authorities. At first glance, it is not entirely clear from this data if the social distancing rules imposed by the federal and local governments during the last two weeks have had a significant effect on the spreading of COVID-19, but the recent decline in the number of daily deaths raises hope.

Maps of cumulative and relative deaths and confirmed cases in Germany by county (Landkreis)

Western and Southern Germany have so far been more affected than Eastern Germany, with some individual counties (Landkreise) at the border to France, Czechia and Austria especially compromised. North Rhine-Westphalia, Bavaria and Baden-Württemberg — and therefore the federated states (Bundesländer) with the most inhabitants — have the most test-confirmed cases as well as deaths. A dashboard provided by the RKI, the GeoHealth Center at Bonn University and ESRI gives a good overview of the official numbers, which are published on a daily basis. The RKI also releases a daily report with relevant information.

Simple estimation based on systematic death lag

It generally is a difficult task to estimate the true number of infected people during an epidemic outbreak. However, we learned about two methods to do so in this excellent post by Tomas Pueyo.

One way is to focus on the current number of deaths. If we know the mean time it takes for an individual from infection to death (in case of death!) and the lethality (general probability to die from COVID-19), then we can calculate an estimation of the number of infected people in the past. We have some information about these two parameters from early scientific studies about COVID-19. We will use a fixed value of 17 days for the time to death and two different values for the lethality: 1% and 5%.

In the figure below, the estimate of the true number of infections for Germany is plotted with a line each for the two lethality scenarios. It can only be calculated for the past before the mean death time, which is indicated in the plot by a black, vertical line.

Estimated true number of infected based on the registered number of deaths (for constant death probabilities 1% and 5% and a mean time from infection to death of 17 days). The red line indicates the officially registered number of infected; blue vertical line indicates the last day for which we currently have data (yesterday); black vertical line demarks the time to which the true number of infected can be estimated (yesterday minus 17 days). Data between black and blue vertical lines are predictions based on exponential growth

The lower the lethality of COVID-19, the higher the number of actually infected people in the past must have been, given the number of deaths that occurred later. We highlight that this estimated statistic is at least one order of magnitude higher than the measured observation of confirmed cases shown with the red line in the plot. Very interesting is the sudden uptick of the latter at the end of February, which is well reflected in the estimated statistic. Keep in mind: The estimation is based on deaths, not on test results! This correlation is therefore a good indicator that the estimate reflects some truth and that the number assumed for the mean time from infection to death (17 days) is not totally off.

Nevertheless this estimator per definition only provides information about the distant past (before the black, vertical line). To extrapolate this statistic until yesterday (after the black and before the blue, vertical line) we need another set of assumptions. In the simplest possible growth model the disease tends to spread in an exponential fashion with a certain time window until the number of infected doubles: the doubling time. We can take the last value I₀ in our first statistic and extend it with a time series of exponential growth with

Iₜ = I₀ x 2^(t/d)

where Iₜ is the true number of infected individuals after the time t. t is counted in days from yesterday minus the mean number of days from infection to death. d is the aforementioned doubling time in days.

The plot above shows three doubling time scenarios (3, 7 or 12 days) for each death probability scenario between the black and the blue vertical line (six scenarios in total). Some of them can already be ruled out considering the real-life testing data: They fall below the red curve. Others remain well possible. An increase of the doubling time is in all cases the desirable scenario and the following weeks will reveal (with their death count) if the social distancing measures prove to be effective to achieve this. Nevertheless it is very likely that far more people are infected right now than testing is able to confirm.

In a last step we can use the estimated infection counts to extrapolate the number of expected deaths in the near future (yesterday plus the mean number of days from infection to death) for the different doubling time scenarios. The lethality is not relevant for this particular approximation, because it already influenced the preceding calculation and is therefore removed from the equation.

Current number of deaths (red line) and predicted number of future deaths (black lines) based on an exponential growth model for the number of past infected

If the number of cases that require intensive care rises above a certain threshold the capacities of hospitals would inevitably run out and the lethality would further increase beyond this projections. This dire possibility became a grim reality in Northern Italy.

Estimation via Bayesian growth models

To complement the analyses above and to make a more educated guess about the parameters visualized so far, we set up a Bayesian model to estimate the true number of infected people through time from both the reported deaths and the reported cases. This model was based on a slightly more complex notion of exponential growth with a built-in slow-down and includes the following assumptions:

  • A death rate of exactly 1% (we discuss deviations from this below)
  • A lag of 17 days between infection and death
  • A lag of 7 days between infection and confirmatory test
  • Exponential growth with a linear decrease of the growth rate due to the imposed social distancing measures

Given these assumptions, we can estimate the true number of infections, as well as the reported number of test cases and deaths. A complete definition and analysis of this model can be found here.

Model results for true (green) and confirmed cases (blue), as well as deaths (red). All three curves come from the same underlying Bayesian model and are estimated from the data (points)

The model predictions (the colored “ribbons”) are shown together with the true reported cases (points). Because this is Bayesian inference, all model predictions are given with quantified uncertainty. Note that we have incorporated only data points between February 23 and April 1 in this analysis. Before that time, Germany did not experience exponential growth yet.

As already shown above, the true number of infections (dark green) based on a death rate of 1% far exceeds the number of confirmed cases. We highlight that this is due to two effects: First, the reported cases and deaths lag behind the true infections, and so under exponential growth we expect the true infections of today to be much higher than the reported ones which were the infected seven days ago. Second, it is clearly expected that not all people with an infection get tested, for example because they don’t show symptoms.

One of the nice features of our model is that we get an explicit estimate of this miss-rate, but it depends linearly on the death-rate. In this case, we have assumed a death rate of 1%, and this yields — shockingly — a probability of getting tested between 12% and 24% only. That would mean that 76–88% of true infected cases are not tested. With a death rate of 3%, for example, the miss-rate would “only” be about 40–60%. So this is hard to estimate, but it’s clear we’re missing a lot!

A significant complication in this regard is introduced by the age structure of the population, because we know that elderly people die with much higher probability from COVID-19 then young people. An important next step for this kind of modelling would be to incorporate more realistic death rates, possibly age-stratified.

The specific growth model with linear slow-down seems to work OK for the data we have, although not perfectly. In particular, the slow down in recent days seems to be stronger than modeled. This is somewhat expected, since the measures against spread of the virus haven’t been “linear” in any way. Nevertheless, a linear slow-down is the first approximation to this process. Based on this, we can again — and this time in a more sophisticated way — try to predict how many cases we will have in the coming weeks. This is of course highly speculative and depends on assumptions in the model. In fact, the uncertainty increases the further you predict into the future, which is visible by the widening of the model bands in the figure. For example, the number of reported cases on April 15 is predicted to be anywhere between 60,000 and 150,000 (though not with uniform probability) according to this model and its uncertainty today. The reported number of deaths by that time are predicted to be between 2700 and 6000 in Germany. These wide intervals simply reflect the limited power of the data to accurately estimate the parameters of the growth model.

A popular choice to illustrate the speed of an exponential growth model is the doubling time in days, which we already employed as a static parameter in the simple model above. Our Bayesian inference now allows to estimate this parameter as a dynamic property of the underlying growth model. Here it is over the course of the last few weeks with a short outlook into the next week:

Estimate of the doubling time in days. The visible slow-down (seen as an increase in the doubling time) is estimated from the data

So there definitely is some indication for a slow-down, with a doubling time just around 2.5 days around the end of February and now a rate around 5 days (the black line indicates the time of this writing), and a future prediction between 7 and 16 days in a week from now. This is interesting in light of comments from officials that a doubling time of 10 days or more should be reached in order to not overwhelm the healthcare system.

Conclusion and Outlook

We highlight three main conclusions from our modelling:

  1. The miss-rate, so the probability for an infected person to not get tested, is one of the big unknowns in all countries currently. We can only estimate this number if strong assumptions on the death rate are made. Reversely, if the miss-rate were known better, this would allow a more accurate estimate of the death rate. One possibility to estimate the true prevalence would be representative random sampling from the population, which in fact is planned.
  2. “Predicting” the epidemiological dynamics into the future remains highly speculative. With Bayesian analyses, the degree of the resulting uncertainty is at least partly “built-in” the model. In our case, we showed that even with an arguably under-complex growth model with linear slow-down, the uncertainty on the number of infections in the future is very large, with predicted numbers to vary over a factor of 10 or more.
  3. One key, and perhaps simplifying, assumption in both our modelling attempts was the “lag” behind infections and test and death, respectively. One way to make these models more correct is by incorporating more realistic data for the course of individual infections. In reality, there is arguably a wide distribution of lag-times until symptoms, until test results, until death, while currently we assume these lag times to be fixed time periods.

We hope that our work may trigger some feedback and motivation for others. It is very easy to get started on working with the data, for example by using our ready-to-use R package. A lot more analyses are possible, when taking into account other data, some of which provided in this package, including county-based information about population numbers, the number of hospital beds, and age structure.

--

--

Clemens Schmid
Stephan Schiffels

Computational archaeologist and PhD student at the MPI SHH/MPI EVA