Image for post
Image for post

Data Loves Covid: Analysis of Excess Deaths

Benjamin Joffe
Dec 9, 2020 · 7 min read

It’s been a confusing year with conflicting messages about masks usage, infections and deaths rates.

Will more people die this year than during previous years? Several media articles compare to 2019 or 2018 but is that enough? How much variation is ‘normal’? I attempted to estimate the real impact of Covid in France with a statistical analysis based on previous years.

  • As of Dec 10, 2020, the French government reported 56,648 deaths due to Covid — almost 10% of the annual average.
  • Is this number accurate? Are some deaths unduly attributed to Covid when people test positive but die from other illnesses?
  • It is possible to do it for most other countries simply with data from google and google sheets functions.

Raw data

Fortunately, the French national institute for statistics (INSEE) has a flurry of data. I found both the population and numbers of deaths from 1982 to 2019:

Image for post
Image for post

There are some immediate comments:

  • The population has increased by about 20% in ±40 years.
  • Deaths grew too (from 550,000 in 1982 to 612,000 in 2019), but only by about 10%.
  • The number of deaths increased notably since 2015.

Estimating annual variations

Unfortunately we don’t have the data for the full year in 2020 (only Jan-Oct) so we’ll look first into the historical data.

We can see the number of deaths varies quite a bit each year (due to flu, heat waves, etc.). To estimate whether Covid truly had an impact, we have to check if it’s beyond the usual variations. If the number of deaths routinely varies a lot, then maybe Covid alone can’t explain a change.

For that, there is a useful measure called the standard deviation (SD or sigma), which helps estimate the range of variation in the series.

  • The SD is the square root of the variance.
  • The variance is the squared deviation from the mean.

For instance, the chart below says the average U.S. female height is 5'8". Approximately 68% of U.S. females fall within 1 SD of the mean (right and left). 95% within 2 SDs.

Image for post
Image for post

The first question here is then: what mean are we talking about?

  • And What time period?
    Should we use all the data from 1982? Only recent years? And why?

Generally, it is better to use more data points. To understand why, imagine you want to know the ‘average value of a dice’ and you roll it only 3 times, versus a thousand times. So I tried at first to take all data from 1982.

  • Which mathematical model?
    Should we use a fixed value?
    A line (y = ax + b)?
    A more complex polynomial (y = a + bX + cX² +dX³ + eX⁴…?)?
    An exponential, logarithm, power series, other?


  • I calculated the average since 1982, which is about 552,000 deaths. However, with the population increase it seems to be a very imperfect model, especially for recent years. Imagine we had 1,000 years of data: the average would vastly underestimate the number of deaths in recent years.
  • Let’s see what a linear regression (a straight line) gives us.

If we focus the Y axis to see the variations better, the variations look pretty wild over the years, with large swings around the line.

Image for post
Image for post

When we calculate the SD we find about 19,000.

It means that yearly variations (plus or minus) of about 19,000 deaths are fairly ‘normal’.

However, the values starting in 2015 are all above 1 SD (real data > value of the model + 19,000). So it doesn’t seem to be a very good model over this timeframe.

  • Let’s look at trendlines using higher order polynomial degrees.
Image for post
Image for post

They are fitting better and better!

In terms of population model, however, it’s hard to explain why there is a dip in the middle. Maybe it has to do with demographics? Baby boomers were still young in the 80’s, but are starting to be quite old in the 2010’s and beyond.

  • Using data from INSEE, we see the 1982 chart with (1) recesses of the first and second world war (2) the chunk of baby boomers (born between 1945 and the 60’s, depending on the definition), all below 40yo in 1982.
  • In the 2020 chart, boomers are now between 50 and 75yo.
Image for post
Image for post
  • Even with an improved life expectancy due to better living standard and healthcare, the age pyramid alone could explain an increase of the number of deaths in recent years.
  • As a reference, the >65yo population went from to 13.5% to 20.7% between 1982 and 2020. It’s a 53% increase! If the life expectancy had remained the same, we should in fact have observed an equivalent increase in the number of deaths. But instead of going from 550,000 to about 840,000, it increased by ‘only’ 10% to 612,000. We’re doing great!

Back to our trendlines, they have another problem: by trying to fit too well, they end with a very steep slope which doesn’t seem to make sense.

In fact, if we try to fit the data even better with higher order trendlines, it becomes downright strange:

Image for post
Image for post

The degree 6 and 8 swing so much it looks like vertical lines. What’s happening is called ‘overfitting’. It tries to stick too closely to the data provided and loses its modeling benefits. The extreme version of overfitting is simply to find a formula that will go connect perfectly all your data points: perfect fit!

So which model to use?

What a friend more versed in statistics than I am told me is “when you don’t have a lot of data, use the simplest model”, so we’ll stick to the straight line, which seems to model quite well the population growth.

Second problem: how many years?

There is an influence of the boomer population, which wasn’t quite a linear change. So maybe we should restrict the number of years we’re using for modeling? But how many?

  • We saw on the first graph that there was a spike of deaths in 2003. This was due to a heat wave.
  • The following year has a dip. This is due to the ‘harvest effect’: many who would have died in 2004 died instead a year earlier.

It seems we could pick the data from 2004 onward, so about 15 years. Here is the resulting graph with a linear model:

Image for post
Image for post

This is looking pretty good!

The SD in this case is 8,568. We can see that the line fits quite well (remember the slope looks steep due to the Y axis not being indexed at zero).

It is tempting to take even less data points: the numbers seem to stabilize from 2015, but then we end up with a mere 5 data points. How meaningful would a model with so few points be? Not much. Still, here is the graph:

Image for post
Image for post

The SD there is a mere 2,505. A bit too perfect maybe?

Estimating deaths in 2020 with the model

Using our linear model over 15 years, we can calculate the expected deaths in 2020: the result is 610,385.

Based on that, the second question is: what is the standard deviation compared to the model?

Adding one standard deviation, the 2020 number is 610,385 + 8,568 = 618,953. Two SDs (which starts to be quite remarkable) = 627,521. Beyond that would be a truly outstanding number.

Estimating deaths in 2020 based on current data

The second question is: while we don’t have the full year data, what is a reasonable estimate for 2020?

  • We could try to take the monthly average of the first 10 months of 2010 and multiply by 12. The result is: 12*574,966/10 = 689,959. That looks very high!
  • As more people die in the Winter we might either underestimate Nov/Dec, or overestimate the ‘harvest effect’ of people who should have died in the Winter. Already we see the limits of this projection…
  • Another model would be to take the minimal values of the previous years in Nov/Dec to try to find a low estimate of 2020.
  • It’s far from perfect but I tried this as an experiment and found 641,188. If we do reach this number by year end, that would be 3.6 standard deviations of our SD15 model. A truly high number.


  • Only real data will tell whether this year has been much worse than ‘normal’ variations.
  • The government reported 56,648 deaths in France. With the above model it seems that if those were all ‘excess deaths’ it would be 7 standard deviations above the expected number. Way out of range especially considering the first 10 months of 2020.
  • It is likely we’ll end up the year between 2 and 4 standard deviations (between 627,521 and 644,657), which would still be a ‘bad’ year (like a bad heat wave or flu).
  • How much of the excess deaths should we attribute to Covid? Maybe Anything above 1 SD would be reasonable?
  • It is also very likely that 2021 will see a dip, likely below even the 2019 level if a ‘harvest effect’ is at play.

Time will tell!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

Check your inbox
Medium sent you an email at to complete your subscription.

Benjamin Joffe

Written by

Partner @ SOSV — $700m VC fund for Deep Tech (biology, robotics, etc.) | Digital Naturalist | Keynote Speaker | Angel Investor

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Benjamin Joffe

Written by

Partner @ SOSV — $700m VC fund for Deep Tech (biology, robotics, etc.) | Digital Naturalist | Keynote Speaker | Angel Investor

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store