How Is COVID-19 Case Data Collected?

The journey from a test swab to a record in a database

Amanda Makulec
Jul 17, 2020 · 12 min read

This week, health informatics became a hot topic in the US as the responsibility for collecting COVID-19 case data shifted from the CDC to the US Department of Health and Human Services.

I have worked on health information system strengthening projects in other countries. In the past, I’ve been met with blank stares when talking about the importance of robust, interoperable information systems with friends (it’s not exactly cocktail party banter), but perhaps the focus on COVID-19 data can kick off these conversations around the range of processes and systems used to collect health information.

Despite the wide demand for daily case-count updates shared across hundreds of global, country, and local dashboards, the results you see today in some of the best-case scenarios are cases that were infected (and potentially infectious to others) days or even more than a week prior. When analyzing COVID-19 case data, we can better make sense of the noise in daily case counts or how new cases lag behind tests administered by understanding how the data is collected.

In this essay, we’ll look at the data collection timeline and processes through a patient’s perspective and in journey maps from three testing sites around the world in the United States, Canada, and South Africa.

From a Patient’s Perspective

Let’s look at an illustrative timeline from a COVID-19 exposure to receiving a test result in Washington, DC in mid-June.

In late May, two different kinds of gatherings were widely discussed: (1) Memorial Day picnics brought together people in environments where wearing a mask and maintaining distance may be challenging, and (2) Black Lives Matter protests advocating for racial justice and equity grew in scope and scale by early June in the wake of the murder of George Floyd.

Those who attended such gatherings were advised to seek a COVID-19 viral test 5–7 days after the event, after which they would wait for 3–5 days for results. DC significantly scaled up testing sites to meet demand, but even with those efforts many waited in long lines to be tested.

New confirmed cases were logged on the day results were received — which, in this best case without further delays, meant test results logged on day 11 after initial exposure. This timeline leaves a window for further community spread unless the patient has been diligent about self quarantine.

In March, NPR wrote about why it takes so long to get COVID results back. While innovations like pooled sample testing can address issues of lab capacity, in many cases our fragmented information systems further complicate the speed of sharing information.

As reports in the U.S. emerge of test results taking days or even more than a week, today’s “real-time” case counts represent tests from days past. With tests recommended 5–7 days after exposure, the patient is uninformed about their status in the critical 4–6 day window after exposure when viral load may peak. Staying home during that window is highly recommended to prevent possible community spread, even without a confirmed positive test result.

These testing timelines and information system nuances may feel overly technical, but any time we work with a data set as data visualization professionals, we should spend time interrogating and understanding how the data was collected.

Data about COVID-19 is no exception — and even if we’re not creating charts and dashboards with this date, understanding how the data is collected can help us be more informed readers of charts and graphs about the pandemic.

How is COVID-19 case data collected?

To understand how case data is collected, we need to understand what steps happen from the time when someone arrives for a viral test and when their result appears in a database. Who is involved? How many different systems does testing data touch? How many different opportunities are there for delays?

To answer these questions, I interviewed three public health experts involved in COVID-19 testing and reporting to learn about the data collection process in early May. I spoke to experts in locations across three different countries to learn about the differences in how the process works in countries with highly decentralized public health information systems (the United States), primarily provincial public health information systems (Canada), and national health information systems (South Africa).

The information gathered represents illustrative information flows for each location based on one key informant interview, and should not be generalized as representing all sites throughout the given country, state, or province. Many people, policies, and technology innovations can impact the processes and timelines. Even from the time of these conversations two months ago to the current state of affairs today, processes and timelines have likely changed.

Instead of providing generalizable processes, these journey maps give us a glimpse into the work that public health, medical, and laboratory professionals do daily to track the spread of COVID-19, and a better understanding of why the COVID-19 case data can be so messy.

About the Journey Maps

The phases of collecting COVID-19 case data are similar across locations: (1) test swab taken, (2) laboratory processes run, and (3) data making its way into a central database that reports into national systems.

Within those phases, the complexity of the process and the number of people and platforms involved varies across counties, states/provinces, and countries.

You may be surprised to see words like “spreadsheet” and “fax,” but the reality is that many public health information systems still rely on these technologies and have significant reliance on public health, laboratory, and medical professionals for data entry and transmission.

The journey maps reflect the process for late April and early May, approximately three months after the first cases were confirmed in the U.S. and Canada and less than two months from the first case in South Africa.

Interviews were conducted through a convenience sample, focused on individuals I knew who were working on the response. Note that in the early days and weeks, these processes were far less efficient as testing and data capture processes were put into place, and over the last two months since the interviews were conducted, the processes and timelines may have evolved.

The number of days to render lab results or notify a patient can vary significantly based on test volume, so timing estimates are presented in the illustrations only when the duration was somewhat consistent at the end of April.

Rockford, Illinois, United States of America

Population: 282,572 (2019 estimate, US Census Bureau)

Key attributes: Rockford, IL is located in Winnebago County, on the border of Illinois and Wisconsin, and had two drive through testing sites as of early May. The city is around 90 miles northwest of Chicago.

Background: In the U.S., the health information systems are highly decentralized, managed at the county and state level. The decentralized systems are one reason compiling and comparing data across US states can be challenging. While the U.S. has a robust system for influenza surveillance, a National Notifiable Disease Surveillance System that is undergoing a multi-year modernization, and has seen widespread adoption of electronic medical records, we still operate with many information silos.

The Auburn High School drive-through testing site operated by Crusader Clinic was set up under the direction of the state governor. A second testing site, at the UIC School of Medicine, was the first in Rockford, and is staffed by the National Guard. The journey map here represents the Auburn High School drive through site.

Notable nuances: In May, approximately half of the patients being tested at the testing site were residents from other states. Many came from Wisconsin, who cross the state border daily for work and would come on their lunch break for testing. For patients with an out of state address, the test results need to be manually shared with the county of residence for the purpose of contact tracing and follow up, which adds additional steps to the data collection and reporting process.

As of early May 2020, the team was conducting 1,100 tests per week. As noted in the journey map, samples are not tested locally and instead are sent to a lab in Springfield daily. Samples include label with unique patient identifiers but are not barcoded. When the site first opened, it took 10 days to get a result from the state lab. By early May, the time had been reduced to two to three days, but can vary depending on lab capacity.

Kingston, Ontario, Canada

Population: 117,660 (2016 Census, Statistics Canada)

Key attributes: Kingston has one of the largest prison populations in the province, but at the time of the interview had not had a large outbreak within the prison. In late June, two correctional officers tested positive, but rapid contact tracing and isolation helped minimize the risk to the wider prison population. In addition, the city has a large university.

Background: In Canada, COVID-19 was added as a reportable disease in January 2020 and added to iPHIS, a health information reporting system created after the SARS outbreak in 2003.

“SARS fundamentally changed how we do public health in Canada,” said Allison Maier MPH. Toronto had the one of the largest SARS case counts in the world outside of Hong Kong. Following the SARS outbreak, the Public Health Agency of Canada was created, including the PHAC National Laboratory in Winnipeg, concentrating public health leadership at the provincial level.

As a result, Canada operates with 13 separate health care systems, one for each province. The Federal government leads care related to the Military and Indigenous populations. The iPHIS platform is used as the infectious disease information system for all reportable diseases across provinces, and works well for case management and as a centralized source of all cases and exposures, but not for contact tracing, which remains a primarily paper based exercise.

At the beginning of the COVID-19 response, from January to March, all samples were tested at the National Laboratory in Winnipeg. Ontario and then British Columbia were the first allowed to run the tests at the provincial level, increasing laboratory capacity.

Notable nuances: In Kingston, a key to their success in collecting quality data was a training nurse who has used iPHIS for over a decade and is passionate about data quality. People, not technology, were critical to early success in collecting and reporting timely, accurate, complete data. The public health officials also partnered closely with the paramedics, and worked to manage personal protective equipment consistently.

Testing criteria in Canada limited who could be tested from January to March in the early days of the pandemic. This same challenge is felt across countries looking to understand the initial spread of infection, as availability of test kits was limited. In early April, Canada expanded the criteria for who could be tested to include behavioral exposures, including health care workers, long term care facility or similar, close contact with a case, travel risk.

Gauteng Province, South Africa

Population: 15.2 million (2019 estimates, Statistics South Africa)

Key attributes: Gauteng is the most populous province in South Africa, home to both Johannesburg, the commercial hub, and Pretoria, the national capital.

Background: In South Africa, COVID-19 case data is collected and managed through a mobile application integrated with the national health information system. The timeline from a community health worker arriving on your doorstep to having a test result in a national database was reported to be less than 48 hours. In countries like the US with more fragmented, siloed systems, information moves through more manual flows.

The mobile data collection application is built on top of the DHIS2 COVID-19 Surveillance Digital Data package, which has also been adapted and deployed in 27 other countries (with 22 other countries in development). DHIS2 is also the open source information system platform used for managing routine health information and has been adapted for other disease specific purposes.

Notable nuances: Investments in the National Health Information System support rapid response through technology. Having a DHIS2 system in place before COVID-19 created an enabling environment for digital data capture.

In South Africa, the COVID-19 response focuses on testing in hotspots across the country and community outreach. Gauteng had done more tests than any other province at the time of the interview, but the largest hotspots as of early May were in the Western Cape Province, where Cape Town is located. As of July 17, Gauteng province — the commercial and government center of the country — has the highest case count, recently surpassing Western Cape.

The response relies heavily on outreach through community health workers, who are a critical part of the health systems in many countries. In the face of COVID-19, community health worker corps can help countries with less resilient health systems rapidly test and trace infected persons’ contacts. You can learn more about the response in South Africa in the DVS Viz Responsibly interview with Yazabantu Soldati.

What can we learn from mapping these data collection processes?

The variance in information systems even across these three examples is striking. In 2006, Larry Brilliant — who had been on the front lines of the small pox eradication — stepped onto the TED stage and shared a vision for how timely reporting of case data about a new disease could stop a pandemic.

Some of that dream has been realized where there are centralized digital public health information systems: South Africa deployed a mobile application for case management in less than two months. Global disease surveillance systems have been built by WHO. But when the United States — the country with the leading case count globally — has major data delays and information silos, that hampers the timeliness and completeness of information about the pandemic.

Three journey maps from a convenience sample don’t define a generalizable process, which would require many more interviews and follow-ups over time. But the illustrations can help us better understand how COVID-19 case data is collected, with a glimpse into the many people and information systems are involved.

Understanding who is involved hopefully engenders greater empathy and understanding for the people — the community health workers, public health professionals, lab workers, transporters, and others — who facilitate the process of gathering this information.

As you monitor dashboards and news about COVID-19 cases, remember:

  • Real time case counts lag behind the current number of infected persons, due to the delays between exposure, onset of symptoms (if any), being tested, and receiving test results. This lag is in addition to the reality that case counts are a function of how many tests are done, which is why metrics like test positivity give us important information on if we’re testing enough people. Minimizing those gaps where possible and creating efficiencies gets us closer to real time information. The best thing to do as an individual is to continue to take preventive measures after any exposure risk: stay home, practice physical distancing if out of the house, washing your hands, and wearing a mask.
  • Differences in how data is collected can create challenges when comparing outbreaks in different locations, including across states or provinces within a country. In the US, at the time of writing this article, self-reported lag times between having a sample taken varied from less than 24 hours to 14 days. Quest Diagnostics, a leading provider of COVID-19 tests, reported a 7+ day wait for test results unless the person is considered a priority patient. This is one reason why rolling 7 day averages may give more comparable trajectories of COVID-19 than daily case counts.
  • Many public health informatics systems rely on a range of different platforms and tools, as illustrated in these journey maps. How people use the tools and manage the processes impacts the timeliness of the data. Manual data capture and transfer processes also introduce opportunities for data quality issues.Understanding what happens before the case counts appear on COVID Act Now or in the New York Times repository can help us ask better questions of the data.

The data is still valuable — lagging data is certainly better than no data at all as we work to combat this pandemic. COVID-19 and the need for rapid information is one of the greatest cases we could make for investments in robust health informatics systems to remove many of the manual steps in sharing data on reportable diseases in many countries, including the United States.

Amanda Makulec is the Senior Data Visualization Lead at Excella and holds a Masters of Public Health from the Boston University School of Public Health. She worked with data in global health programs for eight years before joining Excella, where she leads teams and develops user-centered data visualization products for federal, non-profit, and private sector clients. Amanda volunteers as the Operations Director for the Data Visualization Society and is a co-organizer for Data Visualization DC. Find her on Twitter at @abmakulec

Katherine Haugh is an evaluator and graphic recorder based in Washington, DC.


The Journal of the Data Visualization Society