There are more coronavirus cases than you think.

Danil Kozyatnikov
2 min readMar 24, 2020

--

This article describes my model for estimating the number of actual cases of COVID-19 in different regions displayed on https://coronaviruspredictor.com/

If you haven’t seen the results, here is a screenshot of California’s estimate as of March 23rd.

Background

Official coronavirus reported case numbers are unreliable. Reported deaths are more reliable. I intend to estimate the actual number of cases by looking at mortality data only.

Why are reported cases data not reliable?

There are many reasons for this, including:

  • There are simply not enough tests for everyone.
  • Many people do not exhibit any symptoms.
  • Some people simply wait it out at home and do not get tested.
  • Other people think that they have flu and do not need a test.
  • Some data gets lost and not properly reported in a pandemic situation.

However, all of these people can infect others while being sick and are overlooked by statistics.

What’s up with mortality data?

When people die from COVID-19, it is much harder to overlook and underreport. Almost all of these people are admitted to hospitals by that point and statistics on deaths are much more strict.

The method

  1. Take the number of deaths in any given region.
  2. We know that on average it takes 5 days for people to exhibit symptoms and 14 days to die after showing the first symptoms.
  3. We can assume that on average every reported patient who died got infected 19 days before that.
  4. Then we apply a logistical model to the offset data and approximate it back to the present day.
  5. By taking an average 3.4% mortality rate, we can multiply these cases and get an estimate of the actual number of existing cases at the moment.

The flaws

  • We do not see any effect of preventative measures taken in the last 19 days.
  • While the logistical model is often a good fit, it is not perfect.
    I have excluded all regions with a bad model fit from the reporting.
  • The assumed 19 days and 3.4% rate highly variably by the region. However, if we take a lower mortality rate, we will end up with even bigger numbers. Lowering 19 days will make things look better, but there seems to be little evidence to support that.

Where did the data come from?

All of the averages are actually medians, I used them interchangeably here.

Can I see the code?

Everything is hosted on GitHub. You are welcome to play with it.
https://github.com/Danilka/covid-estimator

--

--

Danil Kozyatnikov

Created an app w/ 2M users, founded @Questli; raised $500K; won TC Disrupt Audience Choice; Suvorov Entrepreneurial Award; Spoke @TEDx; yet I came from Siberia.