# Estimating Coronavirus Prevalence by Cross-Checking Countries

*[If you just want the raw numbers, click **here**, or see the **git repo**.]*

As we scale up efforts to combat Covid-19, a big unknown is the infection prevalence in different countries. Since testing is bottlenecked in most countries, the number of confirmed cases underestimates the true number of cases, and this underreporting rate likely varies across countries. Given this, how can we determine whether a low case count in a country corresponds to few infections or to insufficient testing?

Here I present a method to try to estimate the infection prevalence in different countries. The basic idea is to make use of the data from Singapore and Taiwan, which both perform thorough testing (especially at the border) and also have publicly released data that attributes imported cases to country of origin. By looking, for instance, at the fraction of German visitors to Singapore who test positive for Covid-19, and normalizing this by the overall number of Germans visiting Singapore, we can get some idea of the prevalence of coronavirus in Germany.

One problem with this method is that German travelers are much more likely to have Covid-19 than a random resident of Germany. However, if we assume that this ratio is roughly constant across countries, we can at least estimate *relative* infection prevalence between different countries, and benchmark this against reported infections in countries where testing is widespread enough to get reliable estimates of case counts.

Below I describe how to implement this method using the Singapore and Taiwan data, together with some conclusions, including identifying some countries where the true prevalence of disease may be substantially larger than testing indicates. Since the data are currently both limited and noisy, current estimates based on this method can at most be suggestive. However, as more data becomes available this type of cross-checking across countries could give us a better picture of infection prevalence, especially in regions that lack sufficient testing.

# Data Sources

To apply this method, I collected the following datasets:

- Outbound tourism statistics for Singapore and Taiwan, as reported by the United Nations World Tourism Organization (UNWTO). This counts the number of residents of Singapore and Taiwan that traveled to other countries (indexed by country) in each year from 1995 to 2018. I used the 2018 data, and in cases where there were multiple reported numbers I took the largest one.
- Inbound tourism for Singapore and Taiwan in January 2020, as reported on each of their government websites.
- Coronavirus case data from the AgainstCovid websites for Singapore and Taiwan.

The AgainstCovid data attributes each imported case to its country of origin. I took this as given, but the data are actually quite noisy because many travelers visited multiple countries but the data only report a single country of origin. It’s unclear to me how the country is chosen when there is more than one.

Another issue is that I would like to differentiate between Singapore residents returning from a trip, and foreign visitors coming to Singapore, since I expect infection prevalence to be very different between these two groups. This isn’t directly specified in the data, but I approximated it by checking whether the nationality was “Singapore” or if “Singapore” is used in the case description (to catch people who are work pass holders but not permanent residents). However, inspecting the data carefully reveals many cases where this yields errors.

Combining this data yields 4 data sources: Singapore inbound (visitors), Singapore outbound (returning residents), Taiwan inbound, and Taiwan outbound. You can find the raw data files and a processing script in this git repo. Each dataset contains, for each country, an estimate of the amount of travel and the number of confirmed cases.

# Modeling

Now, how can we model this data? For a country of origin i, and data source j, we’ll make the following definitions:

**A**ᵢⱼ: overall level of travel (monthly or annual) for country i in source j**N**ᵢⱼ: observed number of cases from country i in source j

We’ll make the modeling assumption that **N**ᵢⱼ is a Poisson distribution with rate parameter **A**ᵢⱼ * **λ**ᵢ * **α**ⱼ. What this means is that the expected number of cases should be equal to the total amount of travel, times some source-dependent multiplier **α**ⱼ (accounting for the difference between annual and monthly travel, and inbound and outbound risk factors), times some country-dependent multiplier **λ**ᵢ (the infection prevalence in country i).

If I fit the parameters **α** and **λ** to this data, I get the following results (rows where I don’t trust the numbers are highlighted with a corresponding note). Medium doesn’t handle tables well, so see here for the underlying spreadsheet. For each source I give the estimated prevalence based on that source, the observed case count, and the predicted count (plus or minus one standard deviation) under the model. I also give an overall estimate based on all sources. *[**Important note**: the relative ratios of these numbers are more reliable than the overall prevalence. The overall prevalence is based on my estimate of the global prevalence, but that estimate is extremely noisy and continues to change substantially.]*

A couple of caveats are in order. First, just like reported case numbers have lots of unknown error, so do these. They are based on modeling assumptions that likely do not hold, and the underlying data itself is very noisy. I think the best way to approach this data is as a source of **hypothesis generation** and **sanity-checking**, rather than as an authoritative estimate of results.

This is clear just from looking at some of the numbers; the estimated infection prevalence of 0.05% in Italy is clearly too low, while the Egypt prevalence is too high — a prevalence of 1.94% would be substantially higher than the point at which Italy’s medical system was overrun, and we haven’t seen that in Egypt. For Egypt, we can explain this because 5 of the 7 SG Outbound cases were from the same trip (Turkey has a similar explanation). Italy’s low apparent prevalence is likely due to travel restrictions suppressing the number of cases observed in other countries. This highlights a more general issue, which is that tourism in all countries has likely decreased and the level of decrease is probably country-dependent, reflecting either overall concern or the presence of travel restrictions. This is one way in which the “constant scaling factor” assumption is likely wrong.

Perhaps the most important way in which the constant scaling factor assumption is wrong is that the overall degree of disease penetration in a country **depends on geography**, where major travel hubs will get infected first, then other cities, then more remote areas. How far this has progressed depends on the connectedness of the country and date of first infection, and depending on these factors these estimates may be primarily indicating the rate of infection in major travel hubs and other well-connected cities. My guess is that in many European countries disease penetration is fairly thorough, while it is possible that in the US rates are significantly lower in poorly-connected cities than in hubs such as SF and NYC.

Finally, the overall infection prevalence is scaled by a completely unknown global constant (i.e. the data is equally consistent with every infection prevalence being exactly twice as high or twice as low as stated here). I chose a scaling factor that feels consistent with what I personally believe based on looking at numbers of cases and deaths in different countries (more on this below).

# Key Take-Aways

Despite the above caveats, the estimates of prevalence are reasonably consistent across sources: the estimated UK infection prevalence from different sources is 0.65%, 0.33%, and 0.60%. The US prevalence is 0.19%, 0.14%, 0.20%, or 0.08% depending on source. China is 0.03%, 0.08%, 0.01%, or 0.13%. Other sources with less data show less consistency, but this is compatible with statistical fluctuations due to the very small N that we’re dealing with here.

If we take these numbers at face value, then the **Philippines** has a high rate of infection relative to confirmed cases and deaths, as does **Egypt**, even after accounting for 5 of the 7 Egypt cases having a single source. Moving further down, the 0.04% prevalence in **India**, while small, is much higher than suggested by the current case count. The Philippines and India have been scaling up efforts to fight Covid-19, but it is harder to tell for Egypt, as there has been suppression of news around Covid-19 in Egypt. Testing and equipment targeted at these countries may be particularly valuable.

Some of these conclusions would seem to contradict other data sources. For instance, an 0.04% prevalence in India would lead to enough pneumonia deaths that it should be clear in the data, which we don’t currently see. One possibility is that 0.04% is a better estimate of the prevalence in major cities. Another possibility is that some cases are accidentally attributed to India due to the high prevalence of Indian travelers and foreign workers.

**Improving the data.** This data and the conclusions were based on a quick and dirty analysis, using crude data cleaning steps and the first set of tourism statistics I could get my hands on. I wouldn’t be surprised if there are bugs in the code or in the raw data. I’ve released the code and data (despite them being embarrassingly poorly-documented) so that others can scrutinize it and hopefully build on the analysis. As Taiwan and Singapore data continue to come in, and ideally as other countries release their data, we can begin to have more confidence in some of the conclusions as well.

# Estimating the Overall Infection Prevalence

As mentioned above, all of the numbers in the table are only known up to an unknown multiplier (which we can think of as representing the overall infection prevalence globally). To estimate this, I looked at reported cases and deaths in several countries as of March 20th. As before, see here for the original spreadsheet.

Case prevalence is just the number of cases divided by the population (expressed in percentage, so taking it literally suggests an 0.004% in the UK). Death prevalence is the number of deaths, times 500, divided by the population. I multiplied by 500 because the mortality rate of Covid-19 is on the order of 1%. This would initially suggest we multiply 100, but since it takes a while to die after getting infected, we need to multiply by a larger number to account for the additional doublings of the number of cases that happen between infection and death. My very rough estimate is that the doubling time is 4.5 days, so if we think it takes roughly 10 days from infection to death we should multiply by roughly an additional factor of 5, giving us 500.

I think of the death numbers as being more reliable because you’re more likely to test people for Covid-19 if they end up in an ICU, whereas it’s likely that testing misses many mild cases. Indeed, we need to multiply the case prevalence by about 20 to match the estimates obtained from either deaths or our method, although the actual factor varies across countries.

Looking at the numbers, you’ll notice that France and the Netherlands have 0.35% and 0.31% infection prevalence based on the number of deaths, which are reasonably consistent with the 0.27% and 0.23% obtained via our method. Meanwhile Spain has 1.17% compared to the 0.75% in the table above. These countries have more thorough testing than the UK and the USA, where the infection prevalence based on deaths are much smaller than those computed with the Taiwan/Singapore method. I chose the global normalization constant to roughly match the estimates of the countries that seem to have the most thorough testing.