How does Covid-19 affect social vulnerable populations in the US?

Maximilian
8 min readMay 3, 2020

--

The entire world is suffering under the spreading of the Covid-19 pandemic. The USA is occupying a sad first place with more than 1.1 million confirmed cases and a death toll over 66 thousand (as of Mai 3).

Intrigued by Kaggles Uncover Covid-19 Challenge, I started to wonder what role social vulnerability is playing among the Covid-19 infections and deaths.

I. Introduction and research questions

The US Centers for Disease Control and Prevention (CDC) provides a Social Vulnerability Index (SVI), that aims to help communities to prepare their population for upcoming disasters, just like Covid-19. This SVI includes a lot of different indicators like Poverty, Age Over 65, Minority, Speaks English “less than well”, No High School Diploma etc.

Combining the information of Covid-19 confirmed cases and deaths in the USA with the SVI, I looked for answers to the following questions:

  1. Which US-counties are most affected by Covid-19 regarding infections and deaths?
  2. Is there a correlation between specific social vulnerability indicators and Covid-19 cases as well as deaths?
  3. Is it possible to build a simple Linear Regression Model that predicts Covid-19 cases and deaths based on specific social vulnerability indicators?

II. Which data entered my analysis?

The data about Covid-19 cases and deaths can be retrieved from the already mentioned Kaggle challenge (original source is USAFacts) and SVI website. My findings are based on the following datasets:

  • CSV with confirmed Covid-19 cases in the US by state and county as of April 8
  • CSV with confirmed Covid-19 deaths in the US by state and county as of April 8
  • CSV with SVI for the US by state and county from 2018

All datasets can be found, together with my code, on GitHub.

The SVI file has about 124 columns and provides a lot of information. I centered my investigation on the following 14 indicators:

14 selected social vulnerability indicators including explanation.

Some rows in the SVI file contained null values (-999 in this case), which I dropped. Also, every indicator has its own MOE (margin of error), which I did not take into account in my analysis.

III. Which US counties are most affected by Covid-19, regarding infections and deaths?

If you have looked only once into the media in the last six weeks, you should know that the State of New York is by far the most affected state in the US with more than 120k infections.

US map of states with confirmed cases of Covid-19
Map of confirmed cases in US states.

Therefore, looking at the ten most affected US counties, it is no surprise that seven of those belong to this state. The other three counties are from the states of Illinois, Michigan and New Jersey:

The ten US counties with most confirmed cases of Covid-19

Looking at the timeline for these ten counties, they show the now already well known exponential growth. Queens reported most cases, followed by Kings and Nassau:

Confirmed cases in the US counties on a timeline.
Timeline for confirmed cases of Covid-19 in the ten most affected US counties.

Regarding the death toll, the US map is very similar to the one shown above. However on a county-level, eight of ten are from the State of New York. Besides the New York counties, King County (Washington) and Bergen County (New Jersey) appear on this list:

Deaths caused by Covid-19 in the ten most affected US counties.

The timeline shows a drastic increase of deaths from March 22 in all counties (beside King County) with nearly 300 new deaths in Kings from April 7 to April 8:

Deaths caused by Covid-19 in the ten most affected US counties as of April 8, 2020.
Timeline for deaths caused by Covid-19 in the ten most affected US counties.

IV. Is there a correlation between specific social vulnerability indicators and Covid-19 cases as well as deaths?

Looking a the correlation between Covid-19 cases and the 14 aforementioned social vulnerability indicators on a nationwide level, we see that the highest correlation can be found between confirmed cases and the percentage of “housing in structures with 10 or more units” (EP_MUNIT) with 0.382. Another high correlation with confirmed cases have “households with no vehicle available” (EP_NOVEH) with 0.327 and “speaks English ‘less than well’” (EP_LIMENG) with 0.227.

Correlation between Covid-19 confirmed cases and selected SVI, US wide
Correlation between Covid-19 confirmed cases and selected SVI, US wide

After knowing the correlation on a nationwide level, lets see how this correlation changes when looking at the ten most affected counties concerning the confirmed cases:

Correlation between Covid-19 confirmed cases and selected SVI, ten most affected US counties

In the ten most affected counties (Queens County, Kings County etc.) the housing structure and the availability of a vehicle do correlate less than on a nationwide level. This makes sense, considering that in New York City, problably most of the housing structures have more than ten units and owning a car in New York might be less important.

Here we have other correlations for the confirmed cases of Covid-19: Living in housing units with more people than rooms (EP_CROWD, 0.649) and speaking English “less than well” (EP_LIMENG, 0.603), which sounds plausible considering the easier infection when living in a crowded place. Interesting for this analysis is furthermore, that speaking English “less than well” is very high correlated with living in crowded conditions (0.931) which is why these two indicators appear on the top positions.

However, taking into account the overall SVI ranking variable (RPL_Themes), we see that the counties with most confirmed cases are not the counties with the highest social vulnerability:

SVI overall ranking variable (RPL_THEMES) vs. confirmed cases in the ten most affected US counties.

E.g. the Bronx has a very high overall ranking variable and stays behind Queens, Kings and Westchester, the latter with only half the value of the overall ranking variable.

Now looking at the deaths caused by Covid-19 in the ten most affected counties and the correlation with the selected social vulnerability indicators, we see the following table:

Correlation between deaths caused by Covid-19 and selected SVI, US wide

Just like with confirmed cases, the main correlation exists with multi housing structures (EP_MUNIT, 0.356) and no vehicle available (EP_NOVEH, 0.350).

When looking at the ten most affected US counties EP_CROWD (0.872) and EP_LIMENG (0.795) have an impressingly high correlation, followed by Minority (EP_MINRTY, 0.650) and No High School Diploma (EP_NOHSDP, 0.650).

Correlation between deaths caused by Covid-19 and selected SVI, ten most affected US counties

While the indicator for Poverty (EP_POV, 0.344) has only a low correlation, both top indicators, EP_CROWD and EP_LIMENG, do suggest social poverty, e.g. with less access to education and less possibilities on the housing market.

The high correlation of EP_LIMENG with deaths caused by Covid-19 may imply furthermore that many patients suffering from Covid-19 may look for medical help too late, due to linguistic limitations or because of their resident status.

Looking now at the overall ranking variable for the ten most affected counties, we see a different picture than looking at the confirmed cases:

SVI overall ranking variable (RPL_THEMES) and deaths in the ten most affected US counties.

Here it is clear that the counties with the highest overall ranking variable (RPL_THEMES) do suffer most deaths caused by Covid-19. Counties Wayne (Michigan) and Cook (Illinois) have also higher values while suffering less deaths caused by Covid-19.

This could be also due to the fact that the counties in the State of New York have been the first to be affected by Covid-19, whereas other states and counties could take preventive measures.

Furthermore, the situation in Queens and Kings county could be a look into the future of other counties with similiar social vulnerability indicators if there are not taken enough preventive measures or if these are suspended too early.

V. Is it possible to build a simple Linear Regression Model that predicts Covid-19 cases and deaths based on specific social vulnerability indicators?

Taking into account the social vulnerability indicators, I built some simple Multiple Linear Regressions models to see if I can fit a line for confirmed cases and deaths.

For all four models, I dropped PRL_THEMES (without a larger effect) and for the model for the confirmed cases in the ten most affected counties, I dropped the Poverty variable (EP_POV), which resulted in a very score:

Results for different Multiple Linear Regressions Models

The low score of my model with all counties could be attributed to the fact that the ten most affected counties are like outliers in a group, that has mainly only a few confirmed cases or deaths.

However my approach shows that for the then most affected counties, the score improves, as the data could be more comparable within this specific group.

All models and there coefficients can be found in my GitHub.

VI. Conclusions and more questions

As (nearly) always, looking for answers generates more questions. My short analysis of the confirmed cases and deaths by Covid-19 in the USA and the corresponding social vulnerability indicators showed the following results:

  • Specific social vulnerability indicators have a higher correlation with confirmed cases and deaths in the ten most affected counties than on a nationwide level.
  • These indicators are “living in housing units with more people than rooms” and “speaking English ‘less than well’” for the confirmed cases and deaths.
  • Both indicators do imply social poverty and marginalization, which may trigger a searching for medical help when it is already too late.

In my opinion, there are a lot more interesting variables that could be taken into account for a deeper analysis. These variables are missing in the SVI:

  • Gender: In what way are women more affected than men? In what way is there an intersection between social vulnerability and gender?
  • Race: The SVI has the variable “minority”, however it would be interesting to look at a broader database to identify discrimination and the combination of social vulnerability indicators and race.
  • Health insurance: In what way does the missing of a common health insurance affect the spread of Covid-19?

Furthermore, I have many ideas to improve my code:

  • Take into account the margin of error.
  • See how the results differ if we rely on absolute values and less on the percentage values.
  • Could other ML-models improve the prediction for confirmed cases and deaths caused by Covid-19?

In the meantime, thanks for reading.

Maximilian

--

--

Maximilian

Working as a Business Development Manager with strong interest in Big Data, Machine Learning, movies and opera