Availability of COVID Testing Data: The Technical Challenges Ahead

Stephanie Yang
Atlas Insights
Published in
4 min readAug 5, 2020

As of August 1st, 2020, the total confirmed COVID cases in the U.S. have soared to 4.75 million with over 150 thousand deaths. Yet, the striking number does not reveal the full picture, since the counts of confirmed cases depend on the size of total lab testing. COVID Atlas fellow Ryan Wang emphasized in last week’s blog COVID Testing in the US: the Danger Ahead (Jul 20, 2020) that testing remains an integral part of estimating and controlling the spread of COVID-19 and the COVID Atlas Team is working on building a comprehensive county-level testing database.

This blog provides some updates on the database construction progress and discusses several challenges we are facing.

County-level Testing Data Sources

As noted in last week’s blog post, there is neither a unified definition nor comprehensive data sources for testing data at the county level. To construct a more comprehensive database, we rely on data from three sources: Corona Data Scraper, Worldometer, and web crawlers.

Corona Data Scraper provides time-series lab testing data by county in a CSV format and refreshes on a daily basis. Currently, it updates the testing data of twelve states, namely Arkansas, California, Florida, Illinois, Louisiana, Nebraska, Nevada, North Dakota, Oregon, Tennessee, and Wisconsin. Worldometer provides cumulative testing data of the current date and covers New Jersey, New York, Pennsylvania, Texas, and Washington.

In addition to the above two sources, we are working on implementing web crawlers that scrape data from states’ public health department websites. So far, scripts for Indiana and New Mexico are in the testing phase. Five other states (as colored in dark grey in Plot 1) report data in interactive visualization tools supported by Tableau or Power BI.

Plot 1. Sources of Testing Data by State

Issues in Data Sources and Challenges in the Cleaning Process

In addition to apparent incompleteness, three noteworthy issues of the existing data need to be addressed before moving to further research.

Firstly, the large proportion of unassigned cases brings up questions about the reliability of each county’s reporting. The Worldometer data frame has an additional row for “unassigned cases” for each state, referring to all testing cases within the state but cannot be assigned to a particular county. These cases may be from counties not reporting testing data, but it is also likely that they are missing values of counties with testing data published.

In the case of states with an overwhelming proportion of unassigned cases, a question mark hangs over to what extent the reported county data truly reflect the reality. For example, Ohio reported accumulatively 1.55 million testings as of yesterday, while 1.5 million are unassigned and the county with the most testing, Lucas, reports 37,182 in total. Ohio state’s ratio of total confirmed cases to total testing is 6.13%, while that of Lucas is 13.54%. The situation in Lucas may be far more critical than the average of its state, but it is also possible that the ratio of Lucas is biased by an underestimated denominator. If this is the case, the estimation of testing positivity based on these data becomes less insightful. (Meanwhile, Corona Data Scraper does not include unassigned cases. How they deal with unassigned cases is worth studying as well.)

Secondly, data retrieved from Corona Data Scraper jump between cumulative testing and single-day increase, which calls for substantial human intervention in this phase. Whether the trend is monotonic increasing can be a key to identify the problem, while sometimes recounting leads to slight drops as well. A longer observation period is needed before implementing fully automated error handling functions.

Lastly, the definitions of “testing” remain inconsistent among states. The inconsistency has been mentioned with examples in last week’s blog post and following we listed five different scenarios.

1. Total number of tests performed. For example, New Mexico is reporting a “total tests” number.

2. Serology/antibody and PCR tests. States such as Michigan and Georgia are reporting both PCR and antibody testing numbers.

3. PCR tests only. New Jersey and other states are not reporting antibody testing data, but are noting testing numbers as PCR tests only.

4. Government labs testing and commercial testing. Some states also separate tests based on performed institutions.

5. “Tested persons.” States may also report a total number of persons who were tested in the past period. All testings on a same person are counted together as one.

Some states are transparent in how “testing” data is defined. Yet the variability across definitions, states, and third party data scrapers demonstrate clear challenges to necessary data standardizations for the future.

--

--