Taking a systematic approach to data integrity

How do we identify irregularities across multiple sources of data?

Qiwei Lin
Atlas Insights
4 min readDec 14, 2021

--

As a data engineering research assistant at the Healthy Regions & Policies Lab, I have been involved in maintaining and developing the data processing infrastructure for the US Covid Atlas. In this post, I will discuss some ongoing work on data integrity checks for the Atlas.

Overview of Data-Pulling Pipeline in US COVID Atlas

The US Covid Atlas now has an automatic data-pulling pipeline that updates 31 datasets on a daily basis from multiple sources, such as the CDC, USA Facts, The New York Times, and 1Point3Acres. This collection of frequently updated data includes county- or state-level information on the number of confirmed Covid cases, death, testing, and vaccination. Currently, these datasets are stored as CSV (Comma Separated Value) files. These CSV files are all in wide format and each row represents a time series of a variable observed on a locality (a state or a county) uniquely identified by its FIPS code.

Data Integrity Check

The goal of our data integrity check is to develop automatic data monitoring processes that generate time-series plots for quick exploratory data analysis and alert developers of data irregularities. Data irregularities can include non-monotonic trends in cumulative count data, temporal and spatial outliers. To do this, we built an interactive dashboard using Dash and Plotly in Python, and we are also exploring other data visualization options, such as Observable notebooks (see examples of how we’ve used Observable in the past here). While still in development, the screenshot below shows the prototype of our interactive dashboard for data integrity check.

Users can select a variable of interest and a specific locality (using state or county FIPS codes) to investigate. An option to look into a narrower time window is also supported by the time range slider at the bottom of the left panel. When users select a new variable, the corresponding dataset will be loaded and the option for locality and time slider will be automatically updated as well. These three parameters will be used to subset the data that we use to generate summary results and diagnostics.

The dashboard presents a 7-day average time series line chart by default on the left panel and some diagnostics on the right panel. The 7-day average line chart provides a summary of trends in the time series of interest. The results of monotonicity- and outlier- detection inform us of irregularities that we need to further investigate.

We use lagged differences to detect non-monotonicity in datasets of cumulative counts. Our initial check suggests that non-monotonicity is common. For example, time-series of all localities exhibit non-monotonicity in the cumulative Covid cases data from The New York Times. These irregularities are not always indications of error in our data infrastructure. A decrease in cumulative count data can occur when upstream data curators issue corrections. That said, we still need to properly document all these irregularities to support the development of our data infrastructure and the distribution of data.

Outlier detection is conducted on the lagged difference time series and implemented using Kats, a Python toolkit to analyze time-series data developed by Facebook’s Infrastructure Data Science team. Kats supports a seasonal decomposition of the input time series with additive decomposition specified as default. A residual time series is produced by either removing only trend or both trend and seasonality if the seasonality is strong. Observations outside 3 times the interquartile range will be classified as outliers. The Outlier Detection Table at the bottom of the right panel (also attached below) presents a subset of outliers detected on the selected time series. In the future, we would include more options for outlier detection, such as an option to identify outliers in a user-specified time window or outliers in the cross-sectional distribution.

Localities in different datasets have various numbers of non-monotonic change points and outliers. As a result, it is hard to develop a database with a fixed schema to store this information. JSON (JavaScript Object Notation) is a common data storage format for this type of data. JSON data is stored in key-value pairs and requires no fixed schema. One of our next steps is to automatically detect all these irregularities and store them in JSON for further analysis.

Next Steps

This is just the beginning of our work on data integrity checks. In the future, we plan to include more features in the automatic data integrity check pipeline and integrate the pipeline into our production environment. We also look forward to sharing the Data Integrity Check dashboard in an open format when finalized. If you are interested in understanding the data on the US Covid Atlas in greater detail, please check the Data Documentation page or download data for further exploration and analysis at the Data Download page.

Qiwei Lin is a Research Assistant at the Healthy Regions & Policies Lab, focusing on data engineering for the US Covid Atlas. He is pursuing his Masters in Computational Analysis for Public Policy (MSCAPP) at the Harris School of Public Policy at University of Chicago.

--

--

Qiwei Lin
Atlas Insights
0 Followers
Writer for

Data Engineer Research Assistant at HeRoP Lab at the University of Chicago.