What you should know about COVID-19 data

Understanding the COVID-19 data and its deficiencies, along with the reasons behind them, is essential to protect ourselves.

Published in

COVID-19 DATA

18 min readApr 5, 2020

Since the end of February, I have been looking at graphs to be able to reason about the COVID-19 (pronounced co-vid-19) disease caused by the novel coronavirus named SARS-CoV-2 (pronounced sars-co-v-2). I bet you are doing the same thing. The rapid spread of SARS-CoV-2 is making data collection a very demanding task, resulting in sparse and incomplete data. If we are not careful, the conclusions we derive from these can be very wrong. Unfortunately, a considerable amount of news articles and posts include incorrect information because they overlook the details of how the data they are using is collected and what it means. This adds to the pool of unknowns we have to deal with during this fast-evolving pandemic.

To be able to protect ourselves from incorrect information, we should understand the deficiencies in available data. This way, we can focus on real issues instead of false ones. Moreover, we should understand why and how a claim can be backed by numbers and still be incorrect.

EXISTING DATA

SARS-CoV-2 started spreading among people in China during December. China reported the COVID-19 disease caused by SARS-CoV-2 as pneumonia of unknown cause to the World Health Organization on December 31, 2019. As a result, the first entry that was made public belongs to China as 548 cases and 17 deaths. China provided this primary data entry on January 22, 2020.

Johns Hopkins University COVID-19 Tracking Dashboard on April 4, 2020. Source: JHU CSSE

At this point, there are hundreds, maybe thousands of graphs and dashboards available online about the spread of COVID-19. Most of these, like the one above, use one of the publicly available data sets. Some institutions, a couple listed below, maintain these using announcements made by state and government officials:

Worldometers (World Data)
Johns Hopkins University (World Data)
Our World in Data (World Testing Data)
New York Times (U.S. Data)

Even though all newspaper articles, blog posts, and academic papers use the same underlying data set, they sometimes reach very different conclusions. To understand why, it is important to know how data on COVID-19 is collected.

DATA COLLECTION

To understand how COVID-19 is spreading, officials are collecting and announcing the following: number of tests, number of cases, number of deaths, number of patients in critical condition, and number of recoveries. To be able to make sense of these figures, we need to look at how they are collected.

The number of tests: The number of tests for a given day refers to the number of tests that finished processing on that day. For a given country, this number is determined by how rapidly tests are produced and processed. As a result, when test production and processing speeds increase, the number of tests per day rises as well. While some countries ramped up testing very quickly, others required longer to achieve better testing efficiency. Moreover, testing data is not included in most charts that are presented to the general public, making it difficult for people to put the following numbers into context.

The number of cases: The number of cases for a given day is determined by the number of positive test results collected on that day. An individual infected with COVID-19 is included in the case count only if they get tested and receive a positive result. Because there is only a limited number of tests available and processing times can be long, testing is not available for every individual. In the U.S., at the time of writing, only high-risk individuals with symptoms are getting tested. As a result, official case numbers do not include a lot of individuals that contracted COVID-19 (mainly the ones that have milder or no symptoms). The percentage of these uncounted cases among the total case count is yet unknown.

The number of deaths: The number of deaths for a given day refers to the number of patients who passed away on that day and who had previously received a positive test result. If an individual contracted COVID-19 and passed away before getting tested, they may not be counted in the number of deaths. In some cases, if an individual was not tested or their test was pending, it can be finalized post-mortem. If a hospital needs to use its resources to help other patients, they might prioritize testing other patients over completing post-mortem tests. Even though this would exclude some deaths from the total death count, this would be a reasonable choice to make when lives are at stake. It is also essential to think about the opposite case of an individual that tested positive but passed away due to other reasons. We do not know whether announced death numbers include these rare cases.

The number of patients in critical condition: This number represents how many patients with a positive test result show critical symptoms on a given day. If a person is in critical condition, but they never went to the hospital or did go to the hospital but did not get tested, they are probably not included in this count.

The number of recoveries: The number of recoveries refers to the patients whose COVID-19 test came back negative after having received a positive diagnosis earlier. The COVID-19 tests detect either SARS-CoV-2 genes or the antibodies that our immune system produces to fight the virus. Existing COVID-19 tests are not 100% accurate, and they can give false results. To mitigate this, a lot of hospitals discharge patients only after they receive a couple of consecutive negative test results. If hospitals are low on resources, they might discharge patients with fewer negative test results. As a side note, experts point to these false-negative test results as the reason for reported reinfection cases.

Additionally, all of the reported numbers are aggregates of numbers reported by all hospitals in a country. Depending on how timely the hospitals are reporting the above data, events can be attributed to incorrect days.

COVID-19 map of the USA highlighting case and death counts for the state of Missouri on April 4, 2020 Source: LiveScience — COVID-19 map of the USA highlighting case and death counts for the state of Missouri on April 4, 2020. Source: LiveScience

So, when you look at a map like this and see the counts, you should read it in the following way:

“According to the reports received from hospitals in Missouri, which might be delayed, out of all COVID-19 tests that have finished processing in the state of Missouri since testing has been started, 2,113 of the tests returned positive results. So far, 31 patients that have previously tested positive for COVID-19 passed away.”

DATA TIMELINE

Did you realize something odd when we were reading about the cases in Missouri? When we are talking about cases and deaths for a given day, we are referring to events that occurred in the past:

The positive test results that were announced on April 4 most probably belong to people that started having symptoms a couple of days ago. Considering how long it takes on average for people to show COVID-19 symptoms (median is estimated to be 5.1 days, and 97.5% of those who develop symptoms do so within 11.5 days), these people probably contracted the virus at most two weeks ago.
The patients that passed away on April 4 are among the individuals that had previously tested positive. Because the number of days these patients were fighting COVID-19 is not announced, we do not know when they contracted COVID-19 and when they were counted as an official case.

In a nutshell, numbers that are announced as belonging to the same day do not refer to events (such as contracting the virus) that happened on the same day. This might especially result in an incorrect death rate: When the death rate is calculated, all data that is available until that day might be used. That includes cases that resulted in deaths or recoveries as well as ongoing cases. This timeline discrepancy, combined with the disparity in testing, makes the death rate a very hard metric to calculate accurately.

DIFFERENT COUNTRIES

Since the beginning of the COVID-19 pandemic, every country on earth has feared for the first case to be detected within its borders. At the time of writing, most countries on earth have announced COVID-19 cases. However, every country has been reporting different rates of infection and death.

This variance in infection and death rates might stem from geographic, socio-economic, and political differences between countries. As an example, SARS-CoV-2 spreads quicker in highly populated areas, where people live close to each other. Therefore, when a country quickly makes difficult decisions to enforce social distancing and travel restrictions, the spread of the virus can be slowed down rapidly as well.

To be able to reason about the effectiveness and necessity of these measures for a country, it is crucial to consider the following when analyzing data from different countries:

The number of tests: As we covered earlier, the number of tests that are done directly affects the number of cases that are found. Also, as more testing becomes available more of the mild cases that do not end in deaths are added to the case count. As a result, when a country tests more individuals and finds more cases, the death rate for that country can go down. This is supported by the death rates that are coming out of countries like Germany and South Korea, where testing is more widespread than in other countries.

Population: The numbers that are announced are given as case and death numbers for a given country. However, countries have drastically different populations. To underline this: Wuhan is a city in China, with a population of 11 million people. It is in the province of Hubei that has a population of 60 million. For comparison, that is larger than England (55 million), and a little smaller than France (67 million). On the other hand, Switzerland has a population of 8 million people, just like New York City. Given these differences in population, comparing the total number of cases across countries does not represent reality well. Instead, looking at the number of cases per 1 million people paints a more accurate picture. At the time of writing, the number of cases per 1 million people is 937 in the United States, 2061 in Italy, and 2699 in Spain. While this number is useful to understand the past, if we want to reason about the future, the rate of growth for cases and deaths is a better metric. It is also important to keep in mind that the growth rate is also affected by a lot of factors, such as social distancing and population density.

COVID-19 incubation period: According to the latest studies, 97.5% of the individuals that show COVID-19 symptoms develop them within 11.5 days. After developing symptoms, individuals might get tested and counted as a case. So, the effects of measures such as lockdowns start appearing in the data about two weeks after they are enforced. This delay was seen in China, Italy, and California: About two weeks after the lockdown was implemented, the rate of growth for cases and deaths started slowing down. So when evaluating the effectiveness of measures, it is essential to take this delay into account.

Healthcare system capacity: In severe cases, COVID-19 causes lung tissue damage, making it difficult for patients to breathe. In addition, SARS-CoV-2 infects our own immune cells, causing them to attack our healthy tissues. Therefore, patients whose body cannot withstand these symptoms need to be admitted to critical care, requiring a bed in the ICU and, in most cases, ventilators to keep breathing. When an outbreak spreads too quickly, too many patients might require these facilities at the same time, overloading the capacities of hospitals. To explain this with the numbers from New York at the time of writing: It was reported that there were 2,200 ventilators in the state’s stockpile. Every day 350 new patients were admitted for critical care, needing ventilators. This means all ventilators would be in use in about one week if new case numbers stayed constant. If patients need to be connected to ventilators for more than one week, then there will not be any ventilators available for new patients. In a situation like this, the lack of capacity in hospitals causes deaths that might not have happened otherwise, in effect, increasing the death rate for the country.

Age distribution and health: Lastly, existing data suggests that COVID-19 is more deadly for older individuals. There might be a couple of reasons for this: First, older individuals’ bodies might be worse at withstanding the physical stresses caused by COVID-19. Second, when hospitals are overloaded, and ventilators become scarce, younger patients are often prioritized, increasing the death rate for older patients. As a result, the age distribution of a country can affect the death rate that is attributed to that country. Furthermore, for younger individuals, there might be many factors, such as the number of viruses they have contracted, their overall health, and genetic composition, that might determine the outcome of COVID-19 for them. As of right now, we do not have enough knowledge to understand the importance of these factors. So if younger individuals in a country are more vulnerable due to a factor that we do not know, this would not be predicted by looking at the numbers from other countries.

When you analyze the data from different countries and try to reach a conclusion about the state of things in your own area, it is very important to keep these details in mind.

GENETIC DATA

SARS-CoV-2 continues to spread rapidly around the world. At the time of writing, the number of cases surpassed 1 million. Labs around the world are collecting genetic data about SARS-CoV-2 and there are publicly available datasets and graphs where people can examine those. Similar to the numbers collected about the spread of COVID-19, it is very important to understand this genetic data before drawing conclusions.

To give an example: There is data showing that during its spread, SARS-CoV-2 is undergoing genetic mutations, and some articles suggest that these mutations might change the infectiousness and deadliness of the virus, as well as rendering immunity gained against the virus useless. The map below shows the journey of SARS-CoV-2 using samples from around the world.

World map visualizing the transmission of SARS-CoV-2. Source: Nextstrain

Before reaching conclusions and getting anxious because of a map, it is important to learn the details. Genetic data about SARS-CoV-2 is created by sequencing the genetic codes present in the viruses that are taken from individuals. The RNA molecule in SARS-CoV-2 carries the viral genome it uses to reproduce. The RNA molecule is surrounded by a nucleocapsid protein layer, which is covered by a lipid membrane, which is studded by spike proteins. (This is why washing our hands with soap protects us against COVID-19: Soap breaks the lipid membrane apart, destroying the virus.) SARS-CoV-2 uses these spike proteins to bind to epithelial cells in our lungs, infecting them and using them to copy itself.

An illustration showing the structure of SARS-CoV-2 as described above. — Structure of SARS-CoV-2. Source: Economist Illustrator: Manuel Bortoletti

An individual acquires immunity against SARS-CoV-2 when the antibodies produced by their immune system learn to bind to the spike proteins of the virus. When the spike proteins get covered by antibodies, the virus cannot infect lung cells and copy itself. Therefore, if SARS-CoV-2 was to escape immunity, it needs to mutate in a way that alters its spike proteins. Is this possible? To understand that, we need to understand how viruses mutate.

On a high-level: Once a virus infects a cell, it starts replicating virus parts using the cell as a factory. The blueprints for its parts are encoded in the RNA it carries. The replication process is error-prone, especially when RNA is used. These random errors that happen during replication might have some or no effect on the structure or function of a newly created virus. If the errors do have an effect, it can be negative or positive for the virus. Some errors can change its structure or function, preventing the new virus from spreading further or helping it to escape immunity. It is important to note that, because these errors happen randomly, it is more likely that a given error will not result in a significant change.

When these details about how viruses behave are taken into account, which is not even close to being complete as far as knowledge on viruses goes, the world map above starts looking less scary. In the map, different samples of SARS-CoV-2 are shown as distinctive lines. However, the distinct lines do not imply a difference in the structure or behavior of the SARS-CoV-2 in these samples. In fact, in the original version of the diagram below that is using the same data, you can hover over the dots and see if different samples had any nucleotide mutations.

Diagram visualizing different SARS-CoV-2 samples. Source: Nextstrain.

Going back to the question about whether SARS-CoV-2 can mutate to escape immunity or become more deadly: The answer seems to be yes, these mutations are possible. However, the colorful charts don’t imply that they have already happened, and it is essential to understand how likely they are to happen since this ability changes from virus to virus. For what it’s worth SARS-CoV-2 might be very bad at mutating its spike protein. To get definitive answers, we will have to wait for new research to come out. And for this, we need more time.

RESEARCH AND TIME

The COVID-19 outbreak is a fast-evolving, complicated, and unprecedented situation. To be able to reason about complex situations like this pandemic, we rely on scientific research. In all fields that are relevant to COVID-19, such as mathematics, statistics, economy, data science, machine learning, biology, genetics, virology, chemistry, and medicine, experts use the scientific method to find answers to unknown questions. As an example, when a virology research team wants to find the answer to a question (“How is SARS-CoV-2 affected by temperature?”), they conduct many experiments in their lab and keep a record of the outcomes. A lot of experiments are repeated multiple times, and with different parameters, to increase confidence in collected data. In the end, researchers write an article outlining the techniques they used, data they collected, and the conclusions they derived from their data. Before the article can get published, other experts in the field review it, examining whether the researchers used proper techniques, conducted experiments correctly, and interpreted the data accurately. If approved by the reviewers, the article gets published, becoming part of public knowledge.

Because we have to understand and fight COVID-19 quickly, research that examines COVID-19’s spread and studies SARS-CoV-2 progresses faster than usual. This speed has advantages and disadvantages.

Research that is conducted and published quickly is helping us control the COVID-19 outbreak. There are some great examples of how agile research can help. SARS-CoV-2 tests that were swiftly developed helped us identify and isolate cases. Models built to predict the effects of social distancing helped us slow down the spread. New treatments are helping ease the symptoms for patients. And rapid research on vaccines is giving us hope about eradicating COVID-19.

On the other hand, articles that are published quickly and without review by other experts can include mistakes. Research that normally relies on data from multiple experiments can use data from a single or just a couple of experiments, making it more likely that the conclusions are incorrect. As a result, it is very important to continue questioning and reviewing these articles after they become part of public knowledge. Moreover, it is very important to keep in mind that research experiments are done under specific lab settings; therefore context and expertise are required to draw the correct conclusions from research articles. News articles and other content that gets created without consulting experts can reach incorrect conclusions, misleading the public.

The best example of this is the article that examines the stability of SARS-CoV-2 on different surfaces. The researchers evaluated the surface stability of SARS-CoV-2 in aerosols and on various surfaces. To examine the stability in the air, they inject 50% tissue-culture infectious dose (TCID₅₀) of SARS-CoV-2 into a Goldberg drum, which is basically a closed 40-liter barrel. Then they count the number of SARS-CoV-2 that remain viable.

Graphs showing the viability of SARS-CoV-1 and SARS-CoV-2 in Aerosols and on Various Surfaces. — Viability of SARS-CoV-1 and SARS-CoV-2 in Aerosols and on Various Surfaces. Source: NEJM

Some news articles reported that SARS-CoV-2 stays viable in air for 3 hours, using this article as a source. This conclusion is not correct. The most we can say, by looking at this research is: “In a 40 liter closed drum that was injected with SARS-CoV-2 aerosol, the number of viruses decreased by ten times during the 3 hours the experiment was conducted. The number of viruses did not decrease to 0.” It is not correct to draw the conclusion that SARS-CoV-2 stays viable in the open air after 3 hours by looking at this article. To be able to say that, we would need to conduct experiments in open-air under various conditions.

This, of course, does not mean that this research article is incorrect or insignificant. On the contrary, this research is essential to understand the stability of SARS-CoV-2 on different surfaces. As an outcome of this research, there may be plans to use copper more often for surfaces in hospitals. The important point is, it is vital to draw the correct conclusions from research articles and divert our limited time and resources towards the right problems.

CONCLUSION

In this article, we covered what kind of deficiencies exist in COVID-19 and SARS-CoV-2 data and what we need to pay attention to when we are examining this data. Understanding how data is collected and what timeframe it refers to is essential for correct use. Articles that are written without care for details might draw wrong conclusions and spread incorrect information to the public.

The difficulty in using COVID-19 data should not deter us from investigating, analyzing, and reporting. It is essential that we keep doing research to keep the outbreak under control, to develop treatments and vaccines. Acknowledging the deficiencies in COVID-19 data and the nuances required to work with it would allow us to do our part more skillfully.

So what should you do?

The best thing you can do is recognizing the complexity of the situation we are dealing with and the fact that we do not have all the data to reason about it yet. Therefore, it can be vital for you and the people around you that you listen to the advice from experts, choose to stay on the safe side, and sometimes do things that might seem like overkill. In addition, if you are creating content, it would be wise to question the data and the conclusions you are deriving from them constantly. To give some specific suggestions:

If you are a reporter and you want to cover a research article, you can get more context and learn about the important details to underline by talking to an expert in the field.
Before you share content created by others, you can check if it was created by consulting an expert or whether it is supported by experts.
If you are younger, you might be thinking that COVID-19 is not dangerous for you. Taking the deficiencies in available data into account, you might want to rethink that assumption and take precautions that might save your life.
SARS-CoV-2 can last on various surfaces for some period of time. Even if there is no exact data about the surfaces you interact with in daily life, it is pretty conclusive that SARS-CoV-2 stays viable on surfaces. This is an example of a case where we do not need the most accurate data. We can lower our risk of getting infected by simply washing our hands with soap more regularly.

To conclude, during these unprecedented and highly complex times questioning the data you are looking at can save your life. In a fast-evolving situation where it will take time to develop a complete understanding of, being cautious and safe might be the right choice. So, be cautious, question the things you read, question your own thoughts. Wash your hands. Stay home and stay safe.

I wrote this article to improve data literacy around COVID-19. In addition to listening to experts, we as individuals have to be able to reason about this situation that is affecting our lives to protect ourselves. This can mean life or death for many people. To help spread literacy around COVID-19 please share. This article has been translated to the following languages as well:

Turkish (written by the author)
German (proofread by the author)
Italian (layout and images proved by the author)

You can contact me at contact@denizaltinbuken.com if you would like to translate it to another language.

Data sources

If you are interested in analyzing the data for yourself, you can use publicly available data to create your graphs and models. This exercise will show you how difficult it is to use COVID-19 data and how easy it is to create unreadable graphs like the one below. Below is a list of sources I use that are available on GitHub:

World: https://github.com/CSSEGISandData/COVID-19
The United States: https://github.com/nytimes/covid-19-data

Graph showing the number of cases for every country on earth using unstructured data. Data source: JHU CSSE Graph by Deniz Altinbuken

Disclaimer

I’m not an expert in virology, medicine, biology, genetics, economy, sociology, politics, or statistics. I have a Ph.D. in distributed computer systems, and I’m an expert on building, understanding, simplifying, and improving complex systems using coding, data analysis, and testing. I have been doing research, working with machine learning, and data analysis for many years. This article was not written to critique any specific person or institution. It is intended to be objective, and underline the complexity of the system we are dealing with and the nuances in available data. It is written using data and sources available on April 4, 2020. If you find a mistake, you can send me an e-mail at contact@denizaltinbuken.com.

Sources

The content of this article relies on hundreds of articles I read in the past months and the publicly available data I analyzed for many hours. I wrote this article as a research paper. I’ve listed the sources that I remember reading here, but this is not an exhaustive list.