Predicting COVID-19 With Tax Returns

To explore more data on COVID-19, please go to

Jun 10, 2020 · 7 min read

Income disparities and COVID-19

The relationship between income inequality and COVID-19 has been widely covered by various sources over the last 60 days. Findings show that the number of deaths and hospitalizations is much higher in low income neighborhoods, and cities with high levels of inequality.[1] Most of these studies rely on income data provided by the US Census, which is self-reported, often extrapolated based on relatively small samples (as in the ACS[2]) and fairly simple in the way income is measured (ie “Median Household Income”). A more nuanced and complete view of income can be gleaned from studying IRS income tax data. Tax data is only available publicly at the zip code level, but provides a highly detailed economic portrait of neighborhoods, particularly in the types of deductions that are claimed (dependents, capital gains, education credits, etc). And rather than being voluntarily self reported (as is the case with the Census/ACS), tax returns are mandated by law, with consequences for false reporting.

Table showing correlations between cases per capita and various IRS income metrics
Bars indicate the average child tax credit amount per capita, while the red lines indicate cases per capita
Total deaths related to COVID-19 for zip codes in the upper and bottom quartile in terms of proportion of tax returns claiming capital gains

Predicting Cases

With so many strong linear relationships between per capita COVID-19 cases and IRS income metrics in New York, we decided to build a regression model to see if we can predict cases[3] within a zip code using only tax data.[4]

Predicted vs actual cumulative cases per capita in NYC. R2 = 0.88 RMSE = 9%

Extensibility to other geographies

With these strong results in hand, we wanted to see how extensible a model trained on NYC tax data is to other geographies. Tax and COVID-19 data is highly skewed — NYC has far higher income and case counts than anywhere else in the country; to mitigate this we looked at normalized tax and COVID data relative to the region. We gathered data for four additional cities: Chicago, Baltimore, San Francisco, and Richmond and ran the model attempting to predict normalized cumulative cases for each city. Model accuracy is shown in the table below:

Model performance in other geographies. Cosine Distance measures similarity to NYC, smaller => more similar
Left: choropleth of model performance by zipcode in Chicago. Right: choropleth of model performance by zipcode in Baltimore. Darker => model is more accurate
Model performance across various geographical training sets
  1. ACS is sent to approximately 295,000 addresses monthly (or 3.5 million per year.)
  2. Of course, the number of cases in New York is growing by the day, so for this exercise we are predicting the relative/normalized number of cumulative cases in New York for a particular date (30th May 2020)
  3. As the data is highly right skewed — meaning just a few number of zip codes have very large case counts, while many others have low amounts (relatively) — we transformed the data (taking the square root) for a more normal distribution.
  4. Extending the built in list of kernels in sci-kit learn to include a Power kernel (also known as the unrectified triangular kernel)
  5. Normalized root mean squared error

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store