CORONA: Why the data challenge is real

Eliezer (Ilja) Schwarzberg
Inside WEEL
Published in
12 min readApr 27, 2020

Data is fun!

Data is in!

Data is complicated…

You can view the below as an example of how data scientists view data problems. And since Corona is on everybody’s mind, it will make for a great show case of a highly complex real life problem.

Hence the text below is meant for people either interested in

  • why COVID-19 is such a complex case for (data) science or
  • what questions have to be addressed in a data research

Corona is happening — everybody noticed. News agencies keep us up-to-date with all the possible estimates what happens if, when and how.

The issue is just: how reliable are the estimates?

Models say there will be every possible amount of causalities in USA alone: from current 67 thousand to 2.2 million a month ago.

Chart from Imperial College model 2020–03–16
https://covid19.healthdata.org/united-states-of-america on 2020–04–25

Once you get estimates that far apart questions should be asked. Among others, how can both numbers can be reported with a straight face, expecting the public to respect the “science” behind them? The difference is 33-fold in some 5 weeks. It is almost like telling a college graduate that she is going to make between $30,000 and $1,000,000 a year. Yeah, that’s very likely to be true even if you do not know what she studied and how well she did.

In data science, most problems are called either regressions or classifications. Regression is about a quantitative prediction, in example before it would be the yearly salary of a graduate. Classification is often just about predicting a binary outcome. In the COVID context, deciding if a person has the virus or not is a classification, while predicting the number of fatalities is a regression problem. No one disagrees that COVID-19 is an issue we have to deal with. So the binary challenge we mastered pretty well. However, quantifying the impact is a very different story.

You probably also noticed that the predictions change constantly and the worst of all the public policy is based on them. For a normal citizen it must feel at least confusing to be locked down at home with these jumping numbers as justification. Many folks out there feel rather frustrated with the government on any level that took away their job, business and freedom without strong enough reason.

So far I told you only little new.

Paraphrasing Mark Twain:

If you do not use data for COVID predictions you are clueless. If you do, you are misguided.

The insight I want to share is based on my everyday job. I am a data person in a (still) small but highly data-oriented company. In this role I often have to accompany the data from the “point of birth” to the ready-to-go product: be it fraud notification, management dashboard or crucial machine-learning prediction. In any case, the single biggest take-away from this process is the well-known: Garbage in, garbage out.

Dilbert: Garbage in, Garbage out

All the fancy methods of calculating and presenting will not serve, if your input is not reliable.

The COVID-19 situation is a perfect example where data quality matters. I list below couple points that are very big challenges that make a reliable prediction based on the data practically impossible.

1. Definitions

In every data problem you need to spend some effort defining the terms, for example, if presenting revenue data, are you talking about EBITDA, Net Income or some stage between?

What is a COVID-19 fatality?

One of the most important questions is who actually died of COVID. Depending on the answer, the virus is to be considered more or less severe. However, as you see below there are at least two radically opposite approaches of defining a COVID-19 fatality. To make things worse, a lot of shading in-between the two is possible and plausible.

a) A person with a positive COVID-19 test passes away before recovering

→ This relies on testing and assumes (we’ll get to assumptions later) that a person with COVID died because of COVID. This sometimes will be true but even more so it will be sometimes false. A vast majority of fatalities are elderly people with multiple pre-existing conditions: How can we know for sure they died of COVID and not one of the other issues they had?

Overestimating the causalities

b) A person had COVID-19 symptoms before passing away — even without a test

→ Cough and fever are the most common symptoms. Not everybody who died recently with fever and/or cough had a test. Does it mean they did not have COVID? Some of them for sure did.

Underestimating the causalities

Who is sick with COVID-19?

a) A person with a positive test with and without symptoms

b) A person with a positive test and symptoms

c) A person with symptoms even without a test

As you can see the definition of who is sick will be highly correlated with the question how reliable and available are the tests. Since many places were hit with COVID at the same time, there was not one standard test. Rather multiple countries had their own systems developed. This takes time and leads straight to the next definition problem:

What is the mortality rate of COVID?

In theory the mortality rate is as simple as

However, things are complicated since we do not know exactly neither the exact number of deaths caused by the virus nor exact number of people sick with it.

Also keep in mind that deaths will trail the cases, so if in the beginning of the pandemic you calculate the mortality rate using all cases and all deaths, you will underestimate it by design.

This is an idea you can find in the credit risk: once you start giving out loans your default rate will be very low because it takes usually a while to declare a loan defaulted. So what you do in this case, you either look in closed cases (recovery or death) or you define a lag. Waiting for the cases to be closed can take a loooooong time. If you check the closed cases statistics, the numbers look actually scary. Prime reason for that is that it is comparatively long recovery process and often people have to stay in the hospital to make sure they are not contiguous before being discharged.

Closed cases USA as of 2020–04–27 [source]

Alternatively, you can assume an “average” lag, in case of COVID it could be 2–4 weeks so as to give a sick person a “fair” chance to die, hoping that if they survived for so long they will eventually recover. Practically it would mean, you divide today’s number of total fatalities by the number of cases 2–4 weeks ago:

This approach makes a lot of sense in terms of data, but

  • it is much more complicated to communicate and
  • it lags a few weeks behind the reality.

Moreover, if you want to see a current number, you will encounter an additional challenge: the testing capacity. The more testing happens, the higher is the number of positives and since it is the denominator, you ought to dilute your mortality rate drastically. Only a few weeks later will the death toll starts to catch up, but in the meanwhile you will operate with an underestimated mortality rate.

As said before I do not know what is the solution to all this question is. But if you want to conduct a meaningful research you need to engage with them and (important!) document everything.

2. Assumptions

Dilbert: Bad assumptions give the answer away

Every data science prediction has assumptions. Some of them are obvious and some are hidden, but they are always there. For example, when you calculate Life-time-value of a customer you have to assume inflation. For simplicity reasons it is often set to zero, but zero is an assumption. It might work well for stable economies like USA or Europe, but will be a severe error for a developing market like Argentina.

Many assumptions are made for COVID predictions, among others

  • Country A can use country B’s statistics
  • Positive test = person has COVID
  • If nothing works, let’s compare the hospitalisation figures

Assumptions about other country’s data

After dealing with your own definition issues, you probably come to the conclusion that there is not enough data — you would like to find more. Since the historic data is very limited, you might take another country’s data on cases and fatalities. Also you want to see how your own country with its policies compares to other countries and their policies. After all, maybe you can learn from them.

Comparing countries is a classic double-edged sword. And this is true not just because different countries count their dead differently.

i. Countries are different

ii. Data from other places is not 100% reliable

As of recent there are multiple stories coming out saying China misrepresented its COVID statistics. Similar claims are heard about Iran. Especially early models relied a lot on these two countries, as they were hit hard and early by the virus. As we data people use to say:

Do not rely on any statistic you did not fake on you own

So, of course, you might say, data is not perfect, but it’s at least something. And something is better that nothing…

In data world of prediction something can often be worse than nothing. For example you create a model for stock market prediction. You feed the historic data including the current crash of 30% within a month, without telling the model about COVID. Your model will be doing significantly worse with March/April 2020 data than a model without it. For many data problems partial truth is worse that no data at all.

Assumptions about the test quality

What makes a good test? — It is a simple question with a simple answer:

  • people who have the virus get positive result
  • people who do not have it get negative result
  • the test returns the same answer if we did it again (assuming the person was not exposed between the tests)

So far so good. How to measure test quality? In the media you hear about tests with 95% accuracy. It sounds impressive, but what does it exactly mean?

95% accuracy = from 100 people who took the test, 95 got the result that matched their situation, i.e. infected got positive and not infected got negative; while 5 people got a wrong assessment.

Accuracy is a reasonable metric for balanced problems, meaning there is roughly equal number of positive and negative instances in your sample, for example “how many games will the Pirates win once Corona is over?”

On the other hand, imbalanced problems are very different. Imbalanced in this context means, high discrepancy between number of positive and negative instances. For example, if you wanted to predict who of the people in the airport is a terrorist. Typically there will not be more than 1 terrorist for every 1,000 people in the airport.

Hence, by predicting there is never a terrorist, your “algorithm” will be 99.9% accurate. Just in 1 out of a 1,000 cases you will be wrong…

Therefore, for imbalanced classification problems data scientist use precision and recall. Roughly speaking: Precision is how many of the cases that you predicted positive were actually sick. Recall is how many of the sick people could you identify.

Example:

From 100 people that were tested:

  • 8 had the virus
  • 7 got a positive test result
  • 5 had the virus and got a positive test result
  • 2 got a positive test but no Corona (False positive)
  • 3 had the virus but were not identified (False negative)
  • overall for 95 people the test gave the correct assessment

However,

  • precision = 5/7 = 71.4% (only 5 of 7 people with positive tests actually have Corona)
  • recall = 5/8 = 62.5% (only 5 of 8 people with corona were actually identified)

The numbers from the example above suggest that

  • if 1,000 people were diagnosed with COVID,
  • only 714 actually had it,
  • while additional 429 COVID infected were missed

→ As you see now 95% accuracy is not at all impressive

To make things more complicated, bear in mind that initially, tests were sparsely available and almost exclusively to people who likely had COVID, i.e. severe symptoms and near proximity to another confirmed case. Tests that are designed to identify highly likely cases will probably underperform for the general population. And they did.

After casting doubt on tests, fatalities and seemingly everything, maybe we can rely on the number of people admitted to hospital because of COVID-19?

Is hospitalization rate or number of hospitalizations a good proxy for severity of COVID?

Someone might suggest that because all the statistics are inaccurate, just look on the number of people admitted to the hospitals. Either in total or with COVID symptoms.

This is a dicey idea.

As you know, many governmental policies work toward “flattening the curve”. It means slowing-down the spread of the virus, while ramping up availability of ICU beds, ventilators etc. With other words, the more time passes the more capacity is there to deal with patients. And this means that more people will be admitted to the hospitals since there are less resource restrictions. Crucially, it means that also people, who under other circumstances would not be hospitalized, are admitted to the hospitals now. So we end up with more people in the hospital.

This comes to say that number of people hospitalized is a bad proxy for the severity of the situation, since it is both input and output of the process.

Although, you will object, more people go to the hospitals because more people are infected and feel sick. This is most likely true, but the impact of the virus and loosened conditions for admission point to the same direction: higher hospitalization rate and more people hospitalized. It will be a new hard task to separate these effects. (Reminder: You are in the section about assumptions)

COVID-19 associated weekly hospitalizations in USA per 100k population [source]

As you see in the chart above the number of weekly hospitalizations rose from <1 per 100k population in week 11 to > 7 in in week 14. This steep increase can be attributed to both:

  • more resources deployed to deal with the virus
  • virus spreading and affecting more people

To make things worse, there is another very important effect that should be considered in the hospitalization change: People change their behavior.

And this one can go either way:

  • Some will be very concerned and run to the hospital as soon as they have even mild symptoms or contacted a person with COVID
  • Others however will avoid the hospital no matter what because they are concerned with catching COVID in the hospital

As said earlier I do not have solutions for all the challenges mentioned in the text above. But I want to point out some of them. We live in a complex world and although our computing powers and creativity are amazing, we still have to approach real-life challenges with some level of humility. There is so much we do not know!

With this regard, I clearly do not envy any of the government representatives that have to make decisions and justify them to the public.

In real world, if you do not use data you are clueless. And if you do, you might end up misguided.

--

--

Eliezer (Ilja) Schwarzberg
Inside WEEL

Data Scientist @Weel. Into Machine Learning, Data analysis using Python, GCP, G Suite. Love to get my hands dirty