Innovations for COVID-19

Published in

Engineering Dstilled

6 min readJun 1, 2020

As the COVID-19 pandemic began to unfold, our customers turned to us to help understand how this pandemic was impacting their audiences. We took the opportunity to dedicate a special team to focus our efforts on the immediate needs of our customers.

It is a priority to create tools that help brands effectively understand how COVID-19 changes how customers interact with their brands. We want brands to be able to visualize their data against COVID-19 data to understand where the pandemic is having a negative impact on their audiences, and where there is an opportunity to build market share by serving customers in a time of need.

Maps in Studio with the COVID-19 data layer

Maps Layer with COVID-19 Data

Dstillery offers an existing product, Audience Studio, which has an interactive maps section that projects audiences on a geographical display.

A few main challenges we faced were:

Identify reliable geospatial data source for COVID-19 infections
Integrate the data into our Studio application

How Studio’s Maps Section Works

The Maps application is powered by a robust ETL pipeline, which takes the raw device data, transforms it, distills it, and outputs it conveniently as longitude and latitude coordinates (also known as geospatial data). The pipeline includes a plethora of big data tools and writes to a Postgres database. We use an in-house instance of Geoserver, which queries Postgres for geospatial coordinates and serves them as tiled images. On the front end, we use a library called OpenLayers, which integrates with Geoserver and draws these tiles on the map.

Existing ETL Pipeline for Studio’s map section

Reliable Data Source

Several geo data sources are available online for COVID-19 cases. Many local governments have infection data available to the public on their websites, while some universities and organizations release aggregated data repositories to the public as well for research purposes. The main requirements we have for a data source are:

It must be regularly updated with new cases on a daily cadence
It must be a reliable source
Freely available for commercial use

The first source we looked at was Johns Hopkins University data repository, which powers their own web heatmap. This data source answers all of our criteria, except for the last one. It clearly states that commercial use of the source is prohibited. Other sources we considered either had the same restriction, seemed unreliable, or didn’t have enough geographical coverage. Finally, we landed on the Corona Data Scraper project, which meets all of our requirements.

The Corona Data Scraper project is a public domain and operates under a BSD 2-Clause “Simplified” License, which allows commercial use. The project has a good amount of “stars” and open source contributors on GitHub, plus their website provides a comprehensive list of the underlying sources used, with reliability ratings for the various sources. The data is scraped, aggregated, and updated daily, which fits our needs perfectly. It is provided in the form of a CSV and already includes aggregations based on different levels of geospatial granularity.

Consuming Data and Integrating Into Studio

With a reliable data source in hand, we need an efficient way to integrate it into our Studio application. The requirements for this are:

Usability: We want the COVID-19 maps layer to be available alongside our standard audience layers so that we can inform clients of how their audiences geographically align with infection rates
Speed: We need to get the feature into production FAST. We want to make this information available to our clients during the time when brands are most in the dark on how COVID-19 may affect their customers

Any option involving our existing data pipeline would demand some major efforts to adjust the external data that we acquire to the formats expected by Geoserver queries. The cons of using the existing pipeline violate the latter two requirements of churning this out quickly and ensuring we’re not harming too many of our systems.

This requires some creative thinking. Why do we need to involve our data pipeline at all? Our ETL jobs do a lot to cross-reference, aggregate, and distill device data, but in this case, the data is already aggregated and available as geographic coordinates. To include Postgres and Geoserver in fetching the data would be a redundancy, as we discovered that on top of Geoserver support, the OpenLayers library we use for the web map also supports loading simple GeoJson files. We can do minimal data processing and aggregation to transform the daily CSV from the Corona Data Scraper project into a simple JSON file. Since the data is objectively quite small (around 500Kb), the application can easily cache it in memory and then refresh the cache whenever new data is available.

From here, it is fast, smooth sailing. We have a data-pull python script that downloads the CSV from Corona Data Scraper and performs minimal business logic to filter for US data and data at the county level, using fairly standard libraries such as Pandas. After that, the script extracts the coordinates and county names from the filtered CSV and writes those as a geo-JSON file.

Studio’s COVID-19 layer geo data process

In the application, we set up a simple file listener using java.nio’s WatchService, which is running on a thread separate to that of the rest of the Spring application. What’s ideal about WatchService (introduced in Java 7) is that it doesn’t periodically poll to get updates from the file system. While waiting for notifications, the service blocks execution and does not consume any CPU.

WatchService code snippet from Studio backend. First, we register the target directory’s Path object to a WatchService object, to get notified whenever a new entity is created. Then, the `watchServie.take()` call blocks until such an event occurs.

Visual Features

For the user interface, we have a simple toggle button similar to our existing night mode toggle. This button enables or disables the COVID-19 map layer. It highlights the COVID-19 layer feature as an added dimension of valuable metadata to existing audience offerings. As an unexpected bonus, toggling both the new COVID-19 layer and our existing night mode on makes things look way cool!

After the initial release, we added alerting and failsafe mechanisms against script failures and data outages by monitoring file sizes and timestamps. We also set up a scheduled cleanup script to make sure that while we have some predefined retention of old data.

Closing Thoughts and Next Steps

The COVID-19 crisis reinforces how we must quickly adapt to help our customers through changing situations. It is an excellent reminder of how software engineering projects should always strive for simplicity, speed, and practicality. As seen with this project, it’s not always practical to do things “as we’ve always done.” It’s also not necessarily easier to build up existing software — in some cases, it’s beneficial to write new, independent processes that can be significantly simplified. Had we gone with forcing the COVID-19 data on our existing geo data pipeline, we would have ended up with a much larger feature, a longer development process, and likely quite a bit of custom logic somewhere along the existing pipeline. Instead, we were able to produce a quick solution to an ephemeral problem, with a minimal footprint on our codebase.

At the time of writing, we are still adapting to the physical, mental, sociological, economic, and technical challenges created by the COVID-19 situation. When we continue to work together to simplify, solve, and improve upon every challenge we face, we will persevere. See you all, IN PERSON, on the other side!