Analytics Vidhya
Published in

Analytics Vidhya

Data Loves Covid: Analysis of Excess Deaths

It’s been a confusing year with conflicting messages about masks usage, infections and deaths rates.

Will more people die this year than during previous years? Several media articles compare to 2019 or 2018 but is that enough? How much variation is ‘normal’? I attempted to estimate the real impact of Covid in France with a statistical analysis based on previous years.

  • As of Dec 10, 2020, the French government reported 56,648 deaths due to Covid — almost 10% of the annual average.
  • Is this number accurate? Are some deaths unduly attributed to Covid when people test positive but die from other illnesses?
  • It is possible to do it for most other countries simply with data from google and google sheets functions.

Raw data

Fortunately, the French national institute for statistics (INSEE) has a flurry of data. I found both the population and numbers of deaths from 1982 to 2019:

The numbers of deaths varies more than the total population

There are some immediate comments:

  • The population has increased by about 20% in ±40 years.
  • Deaths grew too (from 550,000 in 1982 to 612,000 in 2019), but only by about 10%.
  • The number of deaths increased notably since 2015.

Estimating annual variations

Unfortunately we don’t have the data for the full year in 2020 (only Jan-Oct) so we’ll look first into the historical data.

We can see the number of deaths varies quite a bit each year (due to flu, heat waves, etc.). To estimate whether Covid truly had an impact, we have to check if it’s beyond the usual variations. If the number of deaths routinely varies a lot, then maybe Covid alone can’t explain a change.

For that, there is a useful measure called the standard deviation (SD or sigma), which helps estimate the range of variation in the series.

  • The SD is the square root of the variance.
  • The variance is the squared deviation from the mean.

For instance, the chart below says the average U.S. female height is 5'8". Approximately 68% of U.S. females fall within 1 SD of the mean (right and left). 95% within 2 SDs.

The first question here is then: what mean are we talking about?

  • And What time period?
    Should we use all the data from 1982? Only recent years? And why?

Generally, it is better to use more data points. To understand why, imagine you want to know the ‘average value of a dice’ and you roll it only 3 times, versus a thousand times. So I tried at first to take all data from 1982.

  • Which mathematical model?
    Should we use a fixed value?
    A line (y = ax + b)?
    A more complex polynomial (y = a + bX + cX² +dX³ + eX⁴…?)?
    An exponential, logarithm, power series, other?


  • I calculated the average since 1982, which is about 552,000 deaths. However, with the population increase it seems to be a very imperfect model, especially for recent years. Imagine we had 1,000 years of data: the average would vastly underestimate the number of deaths in recent years.
  • Let’s see what a linear regression (a straight line) gives us.

If we focus the Y axis to see the variations better, the variations look pretty wild over the years, with large swings around the line.

This doesn’t look like a great fit

When we calculate the SD we find about 19,000.

It means that yearly variations (plus or minus) of about 19,000 deaths are fairly ‘normal’.

However, the values starting in 2015 are all above 1 SD (real data > value of the model + 19,000). So it doesn’t seem to be a very good model over this timeframe.

  • Let’s look at trendlines using higher order polynomial degrees.
Those three models fit pretty well… but what about this dip?

They are fitting better and better!

In terms of population model, however, it’s hard to explain why there is a dip in the middle. Maybe it has to do with demographics? Baby boomers were still young in the 80’s, but are starting to be quite old in the 2010’s and beyond.

  • Using data from INSEE, we see the 1982 chart with (1) recesses of the first and second world war (2) the chunk of baby boomers (born between 1945 and the 60’s, depending on the definition), all below 40yo in 1982.
  • In the 2020 chart, boomers are now between 50 and 75yo.
Boomers have grown up
  • Even with an improved life expectancy due to better living standard and healthcare, the age pyramid alone could explain an increase of the number of deaths in recent years.
  • As a reference, the >65yo population went from to 13.5% to 20.7% between 1982 and 2020. It’s a 53% increase! If the life expectancy had remained the same, we should in fact have observed an equivalent increase in the number of deaths. But instead of going from 550,000 to about 840,000, it increased by ‘only’ 10% to 612,000. We’re doing great!

Back to our trendlines, they have another problem: by trying to fit too well, they end with a very steep slope which doesn’t seem to make sense.

In fact, if we try to fit the data even better with higher order trendlines, it becomes downright strange:

Examples of overfitting models

The degree 6 and 8 swing so much it looks like vertical lines. What’s happening is called ‘overfitting’. It tries to stick too closely to the data provided and loses its modeling benefits. The extreme version of overfitting is simply to find a formula that will go connect perfectly all your data points: perfect fit!

So which model to use?

What a friend more versed in statistics than I am told me is “when you don’t have a lot of data, use the simplest model”, so we’ll stick to the straight line, which seems to model quite well the population growth.

Second problem: how many years?

There is an influence of the boomer population, which wasn’t quite a linear change. So maybe we should restrict the number of years we’re using for modeling? But how many?

  • We saw on the first graph that there was a spike of deaths in 2003. This was due to a heat wave.
  • The following year has a dip. This is due to the ‘harvest effect’: many who would have died in 2004 died instead a year earlier.

It seems we could pick the data from 2004 onward, so about 15 years. Here is the resulting graph with a linear model:

A simple model fitting quite well!

This is looking pretty good!

The SD in this case is 8,568. We can see that the line fits quite well (remember the slope looks steep due to the Y axis not being indexed at zero).

It is tempting to take even less data points: the numbers seem to stabilize from 2015, but then we end up with a mere 5 data points. How meaningful would a model with so few points be? Not much. Still, here is the graph:

Overfitting due to sample bias

The SD there is a mere 2,505. A bit too perfect maybe?

Estimating deaths in 2020 with the model

Using our linear model over 15 years, we can calculate the expected deaths in 2020: the result is 610,385.

Based on that, the second question is: what is the standard deviation compared to the model?

Adding one standard deviation, the 2020 number is 610,385 + 8,568 = 618,953. Two SDs (which starts to be quite remarkable) = 627,521. Beyond that would be a truly outstanding number.

Estimating deaths in 2020 based on current data

The second question is: while we don’t have the full year data, what is a reasonable estimate for 2020?

  • We could try to take the monthly average of the first 10 months of 2010 and multiply by 12. The result is: 12*574,966/10 = 689,959. That looks very high!
  • As more people die in the Winter we might either underestimate Nov/Dec, or overestimate the ‘harvest effect’ of people who should have died in the Winter. Already we see the limits of this projection…
  • Another model would be to take the minimal values of the previous years in Nov/Dec to try to find a low estimate of 2020.
  • It’s far from perfect but I tried this as an experiment and found 641,188. If we do reach this number by year end, that would be 3.6 standard deviations of our SD15 model. A truly high number.


  • Only real data will tell whether this year has been much worse than ‘normal’ variations.
  • The government reported 56,648 deaths in France. With the above model it seems that if those were all ‘excess deaths’ it would be 7 standard deviations above the expected number. Way out of range especially considering the first 10 months of 2020.
  • It is likely we’ll end up the year between 2 and 4 standard deviations (between 627,521 and 644,657), which would still be a ‘bad’ year (like a bad heat wave or flu).
  • How much of the excess deaths should we attribute to Covid? Maybe Anything above 1 SD would be reasonable?
  • It is also very likely that 2021 will see a dip, likely below even the 2019 level if a ‘harvest effect’ is at play.

Time will tell!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Benjamin Joffe

Benjamin Joffe


Partner @ SOSV — Deep Tech VC w/ $1B AUM | Digital Naturalist | Keynote Speaker | Angel Investor | Mediocre chess player, worse at Jiu-jitsu