Estimating the actual numbers of COVID-19 cases in highly infected countries (a quick guess of a lower bound)

Moustafa Othman
7 min readApr 12, 2020

--

It has been suggested that the actual number of Coronavirus cases in many countries is definitely higher than the reported number of cases for several reasons. One reason is simply due to the fact that cases that have been reported at a moment were those who got infected few days up to two weeks before, a period that corresponds to the incubation period of the COVID-19. This is shown in the graph below.

Source: Tomas Pueyo analysis over chart from the Journal of the American Medical Association, based on raw case data from the Chinese Center for Disease Control and Prevention

Another reason is that some cases, specially the young and the healthy, do not show symptoms or that their symptoms are light enough that they do not seek medical help or try to get tested, and probably recover on their own.

There has been an ongoing idea in some countries now to conduct mass testings, by gathering samples from the entire population and start by a random choice testing then following up, if positive, with the family members, neighbors, colleagues etc, creating a network that could save us a lot of resources and time testing everyone altogether. The idea of mass testings is a motivation for a suggestion that the governments of these countries are thinking about letting those who have been infected and recovered and hence gained an immunity (at least temporary) outside. These individuals will hold certificates or bands showing that.

Whether that is a good idea or not, an interesting question is how many people have got infected by the virus already? Is there a way to guess this number before mass-testing everyone?

One idea that I’ve been thinking of is that if we assume that Covid-19 infects individuals from different age groups with equal chances, whether they develop symptoms or not. In other words, the age distribution graph of the actual infected cases in a country should more or less match that of the population of that country. (Assumption №1)

Also, if we assume that the oldest age groups are the ones that develop the worst symptoms and thus seek medical help and get tested for the virus, meaning that the reported number of the cases of the oldest age group is the closest to the actual number of cases of that age group, in comparison to the other age groups. (Assumption №2)

We can then use these assumptions to guess the actual number of cases for the other age groups and thus the entire actual number of cases of that country. However, this method probably just sets a lower bound on the actual cases since, in assumption one, most probably the percentages of actual cases of younger individual would be higher than that of their age group percentages in the entire population, since they tend to be more social and less afraid of getting infected in comparison with the older groups. Also, in assumption two, there are still definitely some infected cases of the older age group that went unreported.

For our calculation, we simply need two pieces of information.

One is the age distribution of the reported cases in the country (not so easy to find for many countries).
The second is the age distribution of the country population in a recent year (easy to find on the internet).

I don’t want to take the topic very precisely as my entire goal is to simply guess an approximation out of curiosity and not for any further serious research. However, I assume that the higher the number of cases in a country compared to its population, the more precise this method is. Also, let’s assume that the country of interest has been already in a lockdown for few weeks that numbers have already settled down. I am simply going to ignore the factor number one mentioned in the first paragraph.

Let’s start by Italy as our first example. Italy has conducted till today (12.04.2020) around 963,473 tests that resulted in around 152,271 reported cases in the country. The number of tests although high is still only around 6 times the number of the reported cases, indicating that they have probably missed a lot of cases that were asymptomatic or simply recovered on their own. I am guessing that the extra 5 portions of the tests that were negatives were those of people who came into direct contact with the positive cases, yet haven’t contracted the virus. Another part is probably those who showed similar symptoms for other illnesses. Back to our method, the age distribution of the reported cases in Italy and the demographic age distribution in 2019 are in the two graphs below.

Italy: Age distribution of the reported cases till (11.04.2020). Source: https://www.statista.com/statistics/1103023/coronavirus-cases-distribution-by-age-group-italy/
Italy population age distribution. Source: https://www.populationpyramid.net/italy/2019/

When you look at the demographic age distribution, you find that the percentage of those aged (70+) is around 17.1% of the total population . Now, you see that it’s less than half of the percentage of that age group in the reported infected graph. If we were to accept the first assumption that individuals from different age groups have equal chances of getting infected, then this would indicate, that we have missed a lot of cases that went unreported. An easy approximation to guess that is by using the second assumption and say that the number of the reported cases of the 70+, being the closest to the actual number of cases of that age group, only corresponds to 16.5% (population percentage of the group) of the total actual cases. The number of the total reported cases in Italy as I mentioned earlier was 152,271 (by 12.04.2020). Of those reported cases, there are around 55,480 cases aged 70+ (corresponding to 36.5%). Then, assuming that these 55,480 cases correspond to only 16.5% of the total actual cases. One can then approximate the total actual cases to be around 324,444 cases. That is around 2.13 the current reported number. This is probably even a lower limit on the actual number out there.

To write this as an equation, it would be something like:

Actual No. of cases = (Reported No. of cases of the oldest age group(s)/ the percentage to the country population) × 100

Another country of interest is Germany. The country has been conducting a relatively higher number of tests of around 1,317,887 (by 08.04.2020) and probably a million and a half by today (12.04.2020) which is around 10.4 the number of the reported cases by (12.04.2020) which is around 127,574 cases. One should then expect that the missing unreported cases percentage is relatively smaller than that of Italy for example, also that the age distribution of the reported cases to be relatively closer to the age distribution of the German population as one can see in the two graphs below. The 1st graph shows the number in millions of each age group in the population. The 2nd graph shows the numbers of the reported cases of each age group, rather than the percentage. One can simply calculate the percentage corresponding to each age group.

Age distribution of the German Population. Source: https://www.populationpyramid.net/germany/2019/
Germany: Age distribution of the reported cases till (12.04.2020). Source: https://www.statista.com/statistics/1103023/coronavirus-cases-distribution-by-age-group-italy/

Like I have done before in Italy, I wanted to calculate the actual number of the infected cases using the percentage of the age group 70+, but for the Corona virus reported cases, there was only data of the 80+ and 60+ age groups. Obviously, the older the age group we take as our reference of the actual cases, the higher the number of the actual cases we will get, but you don’t want to push yourself to a very old group that the number of cases is so small which could lead to a higher statistical error. We have however enough cases in the 80+ cases and thus not so large statistical error (maybe) to justify taking it as our reference group. The percentage of the reported infected cases of that age group is around 9.3% of the total infected cases, that is 11,165 reported cases. The population percentage of the same age group is around 6.9% . Thus, the total number of actual cases in Germany could be around 161,811. That is only around 1.27 of the reported cases which could be one reason that explains the low fatality rate of COVID-19 in Germany compared to some other countries.

For fairness , I will do this calculation again with the 60+ age group. The percentage of the reported infected cases of that age group is around 28.7% of the total infected cases, that is 34,505 reported cases. The population percentage of the same age group is around 28.1% which is very close, indicating the number of the actual cases is almost that of the reported, but again an age group of 60+ is quite young and using it would just let us miss the whole point of this method, so I would rather prefer trusting my guess where I used the 80+ age group data.

To get to verify the assumptions behind this approximation, one can try to look for a model country with way more tests than the found reported cases, for example, South Korea (50:1) or Iceland (20:1). In these countries, the age distribution of the reported cases should be very close to the age distribution of the population of that country. It might not work however since the infected numbers are quite small and could be associated only to a specific group in the society, like that in South Korea, where a lot of the infected are young people from the Shincheonji Church. I will need to take a look at that, maybe in Part II.

Try to check this method in a country of your interest and tell us what you found. If you have another way to guess the actual number of cases, let us know :)

Thank you!

Moustafa Othman

--

--