Can US County Features Help Predict the Spread of a Future Pandemic?

Arman Berek
4 min readJul 21, 2020

The Coronavirus took a huge toll on the United States and really showed that the country was not prepared for such an event. I decided that maybe there is some data that can be used to find counties that are more vulnerable to the spread of a pandemic than others.

Using county data collected from New York Times, Census, CDC, and Google, I created a dataset to predict if a particular county would have 0.1%, 0.2%, and 1% of the population be infected with COVID-19.

Map of counties that have 1% population infected vs predicted counties with 1% infected population.

The results of my analysis in the dashboard above show the success of my predictions with 37 of 2,713 recorded counties being miss predicted for having 1% of population be infected with COVID-19. Now what makes the results as good as they are?

What information can help distinguish vulnerability to pandemic spread?

Using a variety of county data, I pinpointed the top 6 features my model used. Those features are:

  1. Estimated proportion of population with limited English speakers.
  2. Estimated proportion of minority population.
  3. Estimated proportion of population with no high school diploma.
  4. Estimated amount of mobile homes in county.
  5. Estimated amount of people with no vehicles.
  6. Estimated amount of households with at least 10 units.
Top 6 most predictive features.

Another feature worth looking at is the actual size of the counties. It can be seen that counties that are smaller in size are more likely to be predicted to spread the pandemic faster because they may have a higher population density than larger counties.

What insights can be gathered from this analysis?

One thing I noticed with my results is that counties with a large amount of mobile homes are more less likely to cause a spread of the virus within the county. This can be due to the social distancing factor that occurs by living in a mobile home.

Scatter plot of Estimated mobile homes vs percentage of population infected.

Another interesting insight I was able to see is that counties with a higher proportion of limited English speakers also tend to be predicted to spread the virus within the county. However, I currently haven’t found any other data or reason to support this result.

Scatter plot of Estimated percent of limited English speaking population vs percentage of population infected.

How can these results be used in the future?

I believe that something that can help identify vulnerability is the distance between a county and the nearest major city. Consider cities such as New York City, Chicago, San Francisco, Los Angeles, Seattle, and Jacksonville. Looking at the map below, all of those major cities, and others, are located within counties with a higher percentage of their population infected with COVID-19 than others. I believe that counties further from major cities are less vulnerable to the spread of COVID-19.

Map of counties with 0.2% infected population.

Next steps

The next steps I want to take in my analysis is to try to use Descartes Labs Mobility Index to see how movement changes in the population affect the spread of COVID-19. I would also like to see if my model performs just as well with other pandemics. This would test the quality of my data as well as help predict what the effect of a future pandemic would look like.

Thank you for reading my post and please check out my full analysis on my GitHub. Please leave a comment for suggestions and thoughts.

--

--

Arman Berek

Certified Data Science Professional with a passion for learning and teaching.