Visualizing COVID-19 beyond counties
To explore more data on COVID-19, please go to covid19.topos.com
One of the many barriers to understanding the impact of COVID-19 in the United States is that most data is being reported by states at the county or city level. There are plenty of good reasons for this –availability, privacy, etc.– but when explaining the geographic distribution of the outbreak, visually some approaches can be misleading or encourage conclusions that aren’t truly backed by the information at hand.
Using absolute counts there are two primary methods in use for displaying cases and deaths on a map. The first is the choropleth map, where county geographies are shaded evenly to represent the underlying data. This is the technique we use for our Covid Compiler project.
Every display choice involves tradeoffs, and several are made here. One issue specific to choropleths is the county area itself is essentially arbitrary but has outsized visual significance in the representation. For example, the large counties of the southwest account for much of the deepest red shading on the map even though we know much of that area consists of unpopulated national park land.
A second method is the proportional symbol map, where instead of shading an area, an icon is placed at the centroid of each county or city and scaled to represent that location’s value.
Again, there are tradeoffs being made to accomplish different goals. One difference with the choropleth is the visibility of the case distribution. It’s easier in this representation to make out detailed differences between counties in the south that have low counts while still getting a feel for how much greater the counts are in and around the heavily impacted New York City area.
However, as in the choropleth, the arbitrary nature of geographic boundaries has a serious impact on how this map is read. For instance, if the New York City count was broken up into its five constituent counties, the collective radiuses might blend in with those of other locations in the northeast corridor, obscuring the scale of the outbreak there. Likewise, if locations elsewhere were aggregated into fewer counties, their counts might appear misleadingly severe.
In this article we address the issue of county boundaries by breaking apart the underlying absolute counts and putting together a map that better represents the impact COVID-19 is having across the country (while still making some tradeoffs).
To accomplish this, we first disaggregate county totals by distributing individual cases or deaths within the geographical boundaries of each county. These locations are chosen at random first in a uniform manner, and then in a weight-based approach based on correlations with metrics available at a more granular level — census tract, raster cell, etc.
In a second step, we reaggregate those individual values using a discrete global grid. In this article, we use S2 cells which recursively divide the globe into a set of approximately equal area pixels allowing us to choose the level of precision in our visualization.
Map 1: Let’s see what happens if unweighted
For our first attempt, we chose to distribute cases uniformly at random. The visual effect is similar to that of the choropleth map.
Notice the county boundary between Brooklyn and Queens. Intuitively, there’s no reason to believe that the spread of the virus has somehow come to a halt between Bushwick and Ridgewood, but visually this strategy creates a threshold at the arbitrary boundary between the two counties.
Furthermore, we know from news reports and higher resolution information provided by the City of New York that Queens has seen one of the more significant outbreaks in the city; here it appears to have a lower intensity than its neighboring borough — the disparate county sizes visually dilute the represented value.
We can address these issues by leveraging a more detailed quantitative understanding of the underlying geographies within these counties.
Map 2: Using population data to distribute cases
In our next attempt we apply weights to determine how cases are distributed within counties. The most intuitive metric to use here is population, based on the assumption that case counts of COVID-19 are proportional to the number of people living within a given area. Like many metrics provided by the American Community Survey (ACS), population data is available by census block group, a considerably more granular unit than county.
Looking again at New York, we see that the arbitrary threshold discussed above has dissolved and that the magnitude of the impact on Queens is now more clearly visible.
This map shows the same data, but without the two visually misleading effects present in our first map.
Problems that remain and where to go from here
The assumption made earlier —that people are equally likely to have contracted COVID-19 within a given area— may hold at a very high level but is overall intuitively and empirically problematic. We’ve seen significant outbreaks in high-risk work places, disproportionate deaths in communities that are historically disadvantaged, and case counts that correlate negatively with density.
That’s why it’s critical when using the mapping method described in this article (or any visualization), to understand the tradeoffs involved and work to ensure that the overall representation does not create a sense of false precision or indicate conclusions that are unwarranted by the available data. Our goal has been to add incremental improvements to the geographic methods governing the understanding of COVID-19.
One application of this methodology that could help clarify existing reporting with geographical significance is to leverage known correlations with infection and death from COVID-19 as weights in the distribution, techniques which we’ll be exploring in later posts.