COVID-19 Could Hit Low Income Areas Hardest. Here’s What’s Happening Right Now.

Published in

The Startup

8 min readApr 3, 2020

As the news of COVID-19 hitting the US has been coming out over the past couple weeks, I’ve been wondering how COVID-19 is affecting people in different income brackets differently. Our healthcare system is so heavily dependent on how much money people make (think privatized hospitals; privatized insurance) it seems likely that income would play a part in how people fare during this pandemic. Here are the results I found.

Data Collection Process

I first decided to work with data at the county level; states are too big to reflect differences in income levels, while towns are likely too small to provide meaningful data at this point. I then collected median income and population data from the US Department of Agriculture’s Economic Research Service, and have been working with two COVID-19 datasets: one from the New York Times and one from USAFacts. Both COVID-19 datasets show confirmed cases and confirmed deaths sorted by county.

Notes on the Data

Before delving too deeply into what this data shows, it’s important to note a few caveats about these datasets. Here are some of the most important ones:

The population data and median income data are both from 2018, which means two things. One, the data is a bit outdated, and two, any county whose FIPS code changed between 2018 and now was excluded from this study.
The New York Times dataset and the USA Facts dataset do not reflect exactly the same data. The New York Times dataset does not include counties with 0 cases and 0 deaths, while the USA Facts dataset does. In some of the graphs I’ve excluded those counties from the USA Facts data as well for scaling purposes, but in all the statistical analysis I left those counties in. This means that the size of the New York Times dataset (2410 counties) is smaller than the size of the USA Facts dataset (3132 counties). The U.S. has between 3,007–3,142 counties, depending on what you count.
I’ve excluded all 4 areas from the New York Times dataset on this list except for New York City. The New York Times dataset includes all of New York City as one datapoint, while the USA Facts dataset has divided the New York City data by county.
This data does not include Washington D.C. or U.S. Territories.
The data should be relatively up to date. I’ll try to update the graphs & statistical outputs every couple of days, and there’s a note at the top of the data section about how recent the data is.

Fun with Stats

Here’s a quick explanation of how to interpret the numbers below — feel free to skip this if you have a basic understanding of stats (or skip to the Data Summary section if you want a description with no stats at all).

For each graph below, I calculated two values: r values and p values.

R values measure how closely correlated two variables are — the closer to 1 or -1, the stronger the correlation. In psychology, Cohen’s effect size is often used, which says that an R value between .1 and .3 is small; between .3 and .5 is medium; and greater than .5 is large. I’ll be using those cutoffs here, though this “study” doesn’t fit neatly into psychological research — and I’m open to suggestions of more accurate standards to use in the comments.

P values then measure how likely it is that this correlation just occurred by chance as opposed to it actually reflecting an existing correlation. A p value of .05 means that there’s a 5% chance that the correlation you’ve found just happened by chance (which is the cutoff generally used in social science research). Here, I’ll consider any p value of less than .05 “statistically significant” — i.e. there’s a very low chance the data just happened to correlate this way by accident, and therefore a very high chance that this correlation actually exists.

The Data

Here are all the graphs and stats I have based on these datasets. The data in both sets goes through April 5th. Significance is indicated with * for p <= .05; ** for p <=.01; *** for p <= .001.

Median County Income vs Reported Cases

From New York Times dataset; r=0.118***
p=5.581x10^-09

From USAFacts dataset; r=0.175***
p=6.092x10^-23

Median County Income vs Reported Deaths

From New York Times dataset; r=0.084***
p=4.028x10^-05

From USAFacts dataset; r=0.125***
p=2.206x10^-12

Median County Income vs Cases Per Capita

From New York Times dataset; r=0.179***
p=8.168x10^-19

From USAFacts dataset; r=0.218***
p=6.130x10^-35

Median County Income vs Deaths Per Capita

From New York Times dataset; r=0.052*
p=0.0112

From USAFacts dataset; r=0.047**
p=0.008

Median County Income vs Deaths Per Case

From New York Times dataset; r=-0.029
p=0.161

Data Summary (or, What Do The Graphs & Stats Show?)

I used these datasets to look at 5 different things:

Median county income vs reported cases
Median county income vs reported deaths
Median county income vs reported cases per capita
Median county income vs reported deaths per capita
Median county income vs reported deaths per case

Though their specific results were not exactly the same, both datasets showed small, significant, positive correlations for the first four but not the last one. In other words, counties with higher median incomes tended to have more reported cases, reported deaths, reported cases per capita, and reported deaths per capita. Keep in mind, though, that the data on cases doesn’t actually reflect anything about how many people are infected; rather, it reflects how many people tested positive. The correlations among cases likely point to testing being more accessible for people in higher income counties.

What Does It All Mean — And What Does It Not Mean?

As I’m sure we’ve all heard before, correlation is not causation, and this is no exception. We’ve found a few small, significant correlations here, but this data says absolutely nothing about what’s causing those correlations. Maybe it’s just that lower income areas have less well-funded hospitals and less testing available. But maybe those are also areas where more “essential workers” live and have no choice but to leave their houses for work every day and don’t have enough time or money to get tested (the New York Times just published an article on income and peoples’ abilities to stay at home). Maybe it’s related to the quality of public education and the general public’s understanding of how COVID-19 spreads and the importance of tracking it. Maybe it has to do with something totally different — this data does not give us any insight into which of these factors are causing these correlations.

It’s very likely that although the effects are technically statistically significant, they don’t actually reflect meaningful issues or trends in the real world…[However, ] even if the data here don’t mean much for the real world, that doesn’t mean that lower income people aren’t being disproportionately affected on an individual or institutional basis.

I also want to stress that although the effect sizes found here are small to moderate, it is very, very possible that their significance is being driven by how large the sample size is as opposed to them measuring an actual trend in data. Because p values (what we use to decide if data is significant) are calculated partly based on how many data points we started with, it’s much easier to find p values that show significance when you have large datasets like the ones we worked with here. That fact, combined with the fact that the effects found here were fairly small, means that it’s very likely that although the effects are technically statistically significant, they don’t actually reflect meaningful issues or trends in the real world.

That being said, here we specifically looked at county-level datasets. Even if the data here don’t mean much for the real world, that doesn’t mean that lower income people aren’t being disproportionately affected on an individual or institutional basis (again, see the recent New York Times article on income and peoples’ abilities to stay home — and another article on low income areas in New York City being hit hardest). I wouldn’t be surprised if they are, but it’s difficult to look at because of how limited information about individual patients is. If anyone has thoughts about more ways that this could be done I’d love to hear them in the comments.

Where To Go from Here

Though this data doesn’t tell us where these issues are coming from, we can be sure that they are intertwined with already existing structures of inequality in the US. The very fact that there’s no one clear cause of this shows just how widespread the inequality causing these correlations is. That structural inequality will be hard to dismantle overnight, though we certainly can — and should — try.

In the shorter term, I’d be curious to know what’s happening in other countries with differently structured healthcare systems. What do the numbers look like in countries with universal healthcare coverage? Where is this not happening? And what are the healthcare policies that are preventing this from happening in places where we don’t see similar effects?

On that note, all of the data I’ve used is available publicly, listed below, so that you can do your own statistical explorations of what’s going on. I’ve also posted all of my code to github and would be more than happy for others to use and build on it.

If you have questions/feedback/ideas for future directions to take this, please feel free to drop them in the comments below! This is an ongoing project and I’d love to hear thoughts on revisions to make and directions to explore more.

Publicly available datasets I used:

https://www.ers.usda.gov/data-products/county-level-data-sets/
https://github.com/nytimes/covid-19-data (or, see the visualization of this data at https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html)
https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

Other potentially useful datasets I’ve run across:

https://www.bing.com/covid/graphdata