Navigating safely through the sea of COVID-19 Data

Konstantina Slaveykova
8 min readApr 13, 2020

--

Looking at data is not enough: we must think critically about its limitations, assumptions & context

Access to information is a blessing for those of us who are congenitally infected with curiosity about the world, but it can also be a curse. Overexposure to data can easily conjure up the illusion that we are informed enough to truly understand complex issues beyond the scope of our training and knowledge

Data is ubiquitous. We are under a constant barrage of information, summed up in think pieces or propped up in graphs and infographics. The more an issue impacts everyday life, the more tempted we are to confuse mere access to information with an in-depth understanding of an issue. COVID-19 is a major case in point.

We are in the centre of a media storm with information coming in from all kinds of sources: from traditional media coverage to all kinds of user-generated content (UGC) by scientists and non-experts alike. Sometimes it is easy to spot a myth or conspiracy theory but not always. A widely shared social media post which “quoted” scientists from Johns Hopkins University turned out to be completely fabricated, and many media outlets share all kinds of data analyses without checking their reliability.

Source: The Johns Hopkins University Center for Systems Sciences & Engineering (CSSE) COVID-19 Dashboard

The data is still coming in & there is a lot we do not know yet

The growing amount of data analyses of dubious quality (nicely dubbed “slipshod data-crunching” by some science journalists) can easily slip in among legitimate analyses. This is because reliable data is still coming in and is in the process of evaluation.

COVID-19 belongs to a larger family of coronaviruses which has been known to the scientific community since the 1960s. However, it is a novel strain, with a different disease spectrum and transmission efficiency from the two other zoonotic (transmitted from animals to humans) coronaviruses in this family: SARS-CoV and MERS-CoV.

In plain words, this means that even experienced domain experts who know a lot about other coronaviruses still need additional research to address questions about the novel strain: Is this strain of the virus airborne? Is the herd immunity approach viable and how can it be applied without endangering vulnerable populations? Can people get re-infected? Are young people really less vulnerable? What are the chronic impacts on those who recover?

Perhaps the trickiest aspect of collecting and interpreting data about the virus is the long incubation period (2–14 days, with outliers reaching 27 days, compared to 1–3 days for flu), during which the infected carriers are asymptomatic. This severely skews the ability to collect reliable data on the number of infected people in a population and makes any estimates of transmission and mortality rates “utterly unreliable”.

The real rate of transmission is opaque because the rate of detected cases depends on how many people start showing symptoms and seek medical help and on the testing regime in each country. Country decisions on who to test and on what scale can dramatically impact the demographic profile and number of reported cases. This makes comparisons between countries highly unreliable: CSSE data show a massive lack of consistency on how testing data is collected and reported (including, at the time of writing this article: lack of standardised reporting units).

Professor Sylvia Richardson (Director of the MRC Biostatistics Unit) and David Spiegelhalter (Professor of the public understanding of risk) at Cambridge University have compiled an excellent roadmap on what COVID-19 stats we can trust and what we should ignore. They point out that even reliable death rate data is affected by delays in reporting, and the accumulated number of deaths is uninformative for spotting trends. We need daily counts, not accumulated numbers, but such timely information is affected by how consistently data is updated. There is also the issue of mortality displacement (excess mortality) in registering the proper case fatality rate for COVID-19 vs other causes coinciding in time with the pandemic.

Source: Our World in Data is a collaboration project between researchers at the University of Oxford and the Global Change Data Lab. The website provides detailed information about data sources, assumptions, limitations and caveats. If you pay attention to the information on the right, you can quickly see they have noted the inconsistency in reporting test units in official COVID-19 data across countries: samples, tests performed, units unclear, etc.

The biggest problem with unreliable data on transmission rates is that epidemiological models rely heavily on input parameters. In plain English: a model can accurately represent the probability of infections over time (the spread of the virus) only if you can “feed” into it reliable information on the factors which impact this spread. If you lack consistent data on key variables like transmission and mortality rates, the flawed input will produce faulty output. This is especially true for the early stages of collecting information about a disease: “models fitted by early data probably produce results divorced from reality”.

“Models based on assumptions in the absence of data can be over-speculative and ‘open to gross over-interpretation” — Ian Sample, Science Editor at The Guardian

“All models are wrong, but some models are useful.”

This famous quote by British statistician George Best captures a scientific attitude that is not entirely evident in media coverage. Overreliance on modelling has many pitfalls: mainly when used to inform public health policies before there is enough data on the model’s real-life applicability.

Models are useful, but we must be mindful that they rely on a lot of assumptions. Even when they are mathematically sound and supported by prestigious institutions, this is not a guarantee they would stand to scrutiny when applied to the real world.

The UK is an excellent example of how two models by elite institutions, Imperial College and Oxford University, can inform starkly different policy conclusions focusing on the same population. The initial “herd immunity” public policy was informed by the Imperial model. It overlooked early-stage measures and used a 13-year-old code created for a long-feared influenza pandemic which made the fatal assumption that ICU (intensive care units) demand can reliably be kept the same for COVID-19. University of Oxford researchers tested different outbreak scenarios and was more adamant about relying on assumptions rather than hard data, stressing its usefulness for hypothesis testing.

Broadly speaking, there are two types of models: a general model of the epidemic mechanism (highly uncertain at the onset and refined as it acquires more precise data), and a purely empirical model, fitting curves to the already observed data (allowing assumptions how the curve is likely to look in the future). The latter model is extremely sensitive to the data points used for fitting, i.e. it is only as reliable as the accuracy of the entered data. Scientists are well aware of this, but the media fails to accurately report it to the general public, creating confusion over how reliable predictions are.

So what can we trust? Thorough coverage acknowledges both what we know and what we don’t know about the virus

Do not jump to the wrong conclusions: all these caveats do NOT mean that science is powerless or that data is useless. On the contrary: thoroughly collected and complete data is essential for a proper understanding of the situation. Science is powerful precisely because it works through these issues methodically and thoroughly: it just takes some time to test what truly replicates, and what does not.

Discussing caveats and thinking critically about what we do and don’t know is a crucial part of the scientific process. The current crisis has provided an unprecedented boost to open science, international collaboration and a new culture of doing research, making it more nimble and collaborative than ever.

In an era of fake news and absurd conspiracy theories, we must stick to fact-checked information more diligently than ever. After all, good science is not about preaching dogmatic truths: it is about thinking critically, relentlessly testing hypotheses and continually self-correcting older beliefs that are no longer supported by the available evidence.

Source: Dillan Shook, Unsplash

Beware the rise of Sofa Statistics & Armchair Epidemiology

We must be especially diligent about critical thinking and evaluating the quality of information sources in the current flood of information, datasets and graphs on the topic.

It is commendable that non-experts are taking an interest in data and trying to make sense of the pandemic. This incentivised people to think about data, look for access to official statistics and ponder on the available datasets.

The problem is that many people who have no background in epidemiology, virology, medicine, science or statistics use the available data not as a learning opportunity but as a teaching one. Suddenly talk shows, opinion pieces and social media posts are overflowing with what Yale epidemiologist Gregg Gonsalves dubbed armchair epidemiologists.

Media pundits and ordinary people are now sharing improvised analyses and passing judgement on pandemic trends and public health policy, despite hearing the terms herd immunity, R0 and epidemic curve just a week ago.

If I may build upon the term: armchair epidemiologists often double as sofa statisticians. People exposed to simple data reporting (descriptive stats, bar graphs, trend lines, pie charts) are overconfident in interpreting complex datasets and statistical models for which they have no training.

As NYU science journalist Tim Requarth eloquently pointed out:

“Epidemiological amateurs make faulty assumptions, get basic principles wrong, or just pull numbers out of thin air

Opining with numbers just because you use numbers in your day job…is overestimating your abilities while lacking self-awareness of your own incompetence” — Tim Requarth

It is a fallacy to believe that being exposed to some data at your workplace makes you well equipped to understand (super complex) data in any other area. Especially if you then try to add freestyle interpretations of what the data means.

For example, comparing “marketing virality” to understanding actual epidemiological models is just jibberish hidden behind jargon and marketing lingo. Unfortunately, the obscurity of the phrase creates a false sense of complexity and lay audiences can be easily tricked into thinking it is a reliable comparison.

Rule of thumb: when people talk about “actionable epidemic bubbles” and “optimising virality”, turn on your MCBR (Meaningless Corporate Buzzwords Radar) ASAP.

Be respectful when people challenge you, have humility and be honest about the things you don’t know.

Tips on How to Approach Data Uncertainty

  • Think critically: are there buzzwords and bold claims (indications of scoring high on the Dunning-Kruger curve), or is the text balanced and realistic in admitting blind spots and limitations (experts are transparent and confident in recognising what they do not know)?
  • ASK those who are better trained! Instead of reading an article indiscriminately and preaching its conclusions to others, send a link to people with statistical or scientific training and ask them to help you understand and evaluate it (I sent a draft of this article to my statistics professor before publishing).
  • Tap into the online “hivemind”! (No, don’t ask random people on Facebook!) Scientists are very active on Twitter: tweet a link to what you are reading to domain experts or scientists skilled in complex data analysis and ask them if they spot methodological mistakes & data inconsistencies.
  • Use it as a learning opportunity! Instead of glossing over what we don’t understand, we can use it as an opportunity to read more on a topic or even sign up for a course on asking better statistical questions.

--

--

Konstantina Slaveykova

Perpetually curious, alway learning | Analyst & certified Software Carpentry instructor | Based in Wellington, NZ