Row houses in Oella, MD retrieved from Wikipedia Commons

COVID-19: Impact on Housing Security Across the U.S.

Jbochenek
The Startup
Published in
14 min readDec 10, 2020

--

Housing is essential, but not guaranteed. This has never been more obvious than since the start of the COVID-19 lockdowns stranded Americans from their jobs, and thus their incomes. Without income, paying for routine and necessary bills such as food and housing can become a struggle. Housing insecurity is certainly not a new addition to America, but for the first time, we have week by week data on how it has impacted households across America.

Starting in April, the U.S. Census Bureau began a new project, the Household Pulse Survey, with the goal of determining the social and economic impacts of COVID-19 on the American populace. Phase one lasted from April 23rd to July 21st, and this analysis examines those 12 weeks (calendar savvy will notice that this is in fact 13 weeks, but that will be discussed below).

The Household Pulse Survey phase one results are available as Public Use Files (PUF), where each row is a response. However, due to privacy reasons, the PUF does not include location indicators, which was desired for this analysis. Instead, we used the summarized data which was slightly edited due to nested headers. The file we used is available here.

For this, we also worked in Google Colab for easier code sharing across the team. First we imported the necessary packages.

from google.colab import drive
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly
from sklearn import preprocessing
from urllib.request import urlopen
import json
# This will prompt for authorization.
drive.mount(‘/content/drive’)

Then we imported the data:

Household = ‘/content/drive/My Drive/Data/Housing/Household Pulse Survey/phase-one-household-pulse-survey-tool overall.xlsx’Phase1 = pd.read_excel(Household, sheet_name=’Data’)

The data has three different levels of location, nationwide, state level, and the top 15 largest metro areas. It was important to separate these out, as we wanted to make comparisons within these location groups, not between these location groups. We grabbed only the rows we wanted into three different datasets:

State=[‘Alabama’, ‘Alaska’, ‘Arizona’, ‘Arkansas’, ‘California’, ‘Colorado’, ‘Connecticut’, ‘Delaware’, ‘District of Columbia’, ‘Florida’, ‘Georgia’, ‘Hawaii’, ‘Idaho’, ‘Illinois’, ‘Indiana’, ‘Iowa’, ‘Kansas’, ‘Kentucky’, ‘Louisiana’, ‘Maine’, ‘Maryland’, ‘Massachusetts’, ‘Michigan’, ‘Minnesota’, ‘Mississippi’, ‘Missouri’, ‘Montana’, ‘Nebraska’, ‘Nevada’, ‘New Hampshire’, ‘New Jersey’, ‘New Mexico’, ‘New York’, ‘North Carolina’, ‘North Dakota’, ‘Ohio’, ‘Oklahoma’, ‘Oregon’, ‘Pennsylvania’, ‘Rhode Island’, ‘South Carolina’, ‘South Dakota’, ‘Tennessee’, ‘Texas’, ‘Utah’, ‘Vermont’, ‘Virginia’, ‘Washington’, ‘West Virginia’, ‘Wisconsin’, ‘Wyoming’]US = [‘United States’]Metros=[‘Atlanta-Sandy Springs-Alpharetta, GA Metro Area’, ‘Boston-Cambridge-Newton, MA-NH Metro Area’, ‘Chicago-Naperville-Elgin, IL-IN-WI Metro Area’, ‘Dallas-Fort Worth-Arlington, TX Metro Area’, ‘Detroit-Warren-Dearborn, MI Metro Area’, ‘Houston-The Woodlands-Sugar Land, TX Metro Area’, ‘Los Angeles-Long Beach-Anaheim, CA Metro Area’, ‘Miami-Fort Lauderdale-Pompano Beach, FL Metro Area’, ‘New York-Newark-Jersey City, NY-NJ-PA Metro Area’, ‘Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area’, ‘Phoenix-Mesa-Chandler, AZ Metro Area’, ‘Riverside-San Bernardino-Ontario, CA Metro Area’, ‘San Francisco-Oakland-Berkeley, CA Metro Area’, ‘Seattle-Tacoma-Bellevue, WA Metro Area’, ‘Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area’]StatesP1 = Phase1[Phase1[‘Geography (State or Metropolitan Area)’].isin(State)]USP1 = Phase1[Phase1[‘Geography (State or Metropolitan Area)’].isin(US)]MetroP1 = Phase1[Phase1[‘Geography (State or Metropolitan Area)’].isin(Metros)]

It soon became obvious that 50 states was a large number to handle, in visualization, so we added another level to the state data — Divisions. The U.S. Census defines the US by several location levels, one of the most familiar is Regions: Midwest, Northeast, South, and West. There are also Divisions, which split the regions up even smaller. Figure 1 below shows the breakdown of Regions into Divisions.

Figure 1. Regions and Divisions of the United States

We used a data dictionary to add that to the data. I’m including this here so maybe no-one else has to write this code again.

Divisions = {‘Alabama’: ‘East South Central’,
‘Alaska’: ‘Pacific’,
‘Arizona’: ‘Mountain’,
‘Arkansas’: ‘West South Central’,
‘California’: ‘Pacific’,
‘Colorado’: ‘Mountain’,
‘Connecticut’: ‘New England’,
‘Delaware’: ‘South Atlantic’,
‘District of Columbia’: ‘South Atlantic’,
‘Florida’: ‘South Atlantic’,
‘Georgia’: ‘South Atlantic’,
‘Hawaii’: ‘Pacific’,
‘Idaho’: ‘Mountain’,
‘Illinois’: ‘East North Central’,
‘Indiana’: ‘East North Central’,
‘Iowa’: ‘West North Central’,
‘Kansas’: ‘West North Central’,
‘Kentucky’: ‘East South Central’,
‘Louisiana’: ‘West South Central’,
‘Maine’: ‘New England’,
‘Maryland’: ‘South Atlantic’,
‘Massachusetts’: ‘New England’,
‘Michigan’: ‘East North Central’,
‘Minnesota’: ‘West North Central’,
‘Mississippi’: ‘East South Central’,
‘Missouri’: ‘West North Central’,
‘Montana’: ‘Mountain’,
‘Nebraska’: ‘West North Central’,
‘Nevada’: ‘Mountain’,
‘New Hampshire’: ‘New England’,
‘New Jersey’: ‘Middle Atlantic’,
‘New Mexico’: ‘Mountain’,
‘New York’: ‘Middle Atlantic’,
‘North Carolina’: ‘South Atlantic’,
‘North Dakota’: ‘West North Central’,
‘Ohio’: ‘East North Central’,
‘Oklahoma’: ‘West South Central’,
‘Oregon’: ‘Pacific’,
‘Pennsylvania’: ‘Middle Atlantic’,
‘Rhode Island’: ‘New England’,
‘South Carolina’: ‘South Atlantic’,
‘South Dakota’: ‘West North Central’,
‘Tennessee’: ‘East South Central’ ,
‘Texas’: ‘West South Central’,
‘Utah’: ‘Mountain’,
‘Vermont’: ‘New England’,
‘Virginia’: ‘South Atlantic’,
‘Washington’: ‘Pacific’,
‘West Virginia’: ‘South Atlantic’,
‘Wisconsin’: ‘East North Central’,
‘Wyoming’: ‘Mountain’}
StatesP1[“State”] = StatesP1[“Geography (State or Metropolitan Area)”].astype(‘category’)StatesP1[‘Division’] = StatesP1[‘State’].map(Divisions)

We needed to do some exploratory data analysis to determine the quality of the data and any adjustments that would need to be made.

sns.displot(StatesP1, x=”Housing Insecurity Percent”, element=”step”, col=”Division”, col_wrap=3)g = sns.boxplot(x="Division", y="Housing Insecurity Percent", #hue="Selected Horizontal Dimension", data=StatesP1, palette="Set3")
g.set(xlabel='Division', ylabel='Housing Insecurity (%)')
g.set_xticklabels(g.get_xticklabels(),rotation=45,ha="right")
Figure 2. Histogram of Housing Insecurity Percent from the Household Pulse Survey by Census division from April 2020 — July 2020
Figure 3. Boxplot of Housing Insecurity Percent from the Household Pulse Survey by Census division from April 2020 — July 2020

Overall, we were very pleased with the distribution of data, in the histograms it shows as relatively normal and in the boxplots we only see one true outlier. For the purposes of this analysis, we kept that outlier as it was important trend data.

We also wanted to get a first look at the actual data, how has the housing security changed over the 12 week period across the US?

g = sns.relplot(kind = ‘line’, data=StatesP1, y =’Housing Insecurity Percent’, x= ‘Week Number’)g = sns.relplot(kind = ‘line’, col=’Division’, col_wrap=5, col_order =[‘Pacific’, ‘West North Central’, ‘East North Central’, ‘Middle Atlantic’, ‘New England’, ’Mountain’, ‘West South Central’, ‘East South Central’, ‘South Atlantic’ ], data=StatesP1, y =’Housing Insecurity Percent’, x= ‘Week Number’)
Figure 4. Housing insecurity as a percent of households across the US over 12 weeks from April — July 2020
Figure 5. Housing insecurity as a percent of households across the US over 12 weeks from April — July 2020 by division and organized roughly into geographic shape of the US

It would be a good time now to explain week 1 data. Week 1 data actually encompasses two weeks. This was due to a low response rate to the first week of survey deployment, which necessitated a second week of data collection. Thus week one is from April 23rd to May 5th. In phases 2 and 3, data collection changes to a 2 week period instead of once a week.

In the above graphs (figures 4 and 5), we can see that the impact has changed over the weeks, and in figure 5 it changed differently depending on division. It is important to keep in mind that the data collection started at the end of April, so there are social and economic impacts of COVID-19 that predate this data collection. The week 1 percentages are not a pre-COVID-19 comparison value (in fact, one major limitation is that there is no pre-COVID-19 data point available to make such comparisons).

Another way to visualize this is by plotting it on a map of the United States, with time as an animation level. Plotly cloropleth maps require FIPS codes or state abbreviations instead of state names as input, so we added that first (not shown here).

with urlopen(‘https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
fig = px.choropleth(StatesP1, locationmode ="USA-states", locations="state", scope="usa", color="Housing Insecurity Percent", color_continuous_scale="Viridis", range_color=(0, 40), hover_name="State", animation_frame="Week Number", category_orders={"Week Number": [1,2,3,4,5,6,7,8,9,10,11,12]})fig.show()

Using screentogif open source software, we screen-capped the resultant animation to include it in reports.

Figure 6. Weekly housing insecurity across the US by percent of households from April — July, 2020

Looking at the image above, there does not seem to be a clear trend to the data. Instead, the increases and decreases seem to be random noise at this point. It now became important to add in the COVID-19 data to the mix to determine if it would add helpful information.

We used the New York Time’s COVID-19 dataset (available here) but also added in population level detail to get percentage based infection rates. The altered dataset can be found here.

covid19 = ‘/content/drive/MyDrive/Data/COVID-19/New York Times/all-states-history.csv’
COVID19 = pd.read_csv(covid19)

We then mapped on both Regions and Divisions to the data. Next we adjusted the rate of positive cases, positive percentages, death rates, and hospitalization raters by population figures.

COVID19['Positive Increase %'] = (COVID19['positiveIncrease']/COVID19['Population'])*100COVID19['Positive %'] = (COVID19['positive']/COVID19['Population'])*100COVID19['Death %'] = (COVID19['death']/COVID19['Population'])*100COVID19['Hospitalized %'] = (COVID19['hospitalized']/COVID19['Population'])*100

The next step was to change the day level COVID-19 data to week level data to match the Housing Insecurity data. For reasons known only to the U.S. Census Bureau, the weeks start on a Wednesday and run until the next Tuesday. Luckily PeriodIndex has the ability to call a specific start and end date for each period.

COVID19.date = pd.to_datetime(COVID19['date'],errors = 'coerce')COVID19['Week'] = pd.PeriodIndex(COVID19.date, freq="W-WED")Week_No = {'2020-04-23/2020-04-29': 1,
'2020-04-23/2020-05-06': 1,
'2020-05-07/2020-05-13': 2,
'2020-05-14/2020-05-20': 3,
'2020-05-21/2020-05-27': 4,
'2020-05-28/2020-06-03': 5,
'2020-06-04/2020-06-10': 6,
'2020-06-11/2020-06-17': 7,
'2020-06-18/2020-06-24': 8,
'2020-06-25/2020-07-01': 9,
'2020-07-02/2020-07-08': 10,
'2020-07-09/2020-07-15': 11,
'2020-07-16/2020-07-22': 12}

Since data is now in week format, it became important to adjust the infection rate to account for the 7 day pile-up.

covidTemp = COVID19.groupby(['state', 'Week'])[['Positive %','Positive Increase %']].sum().reset_index()covidTemp['Cumulative Positive 7 Day Ave (%)'] = covidTemp['Positive %']/7covidTemp['Weeks'] = covidTemp['Week'].apply(str)
covidTemp['Regions'] = covidTemp['state'].map(Regions)
covidTemp['Week Number'] = covidTemp['Weeks'].map(Week_No)

Mapping COVID-19 infection rates across the US looks something like this (week 11/12–11/18 was removed because it was not a complete week). Both graphs were made with the following code, just adjusted for target variable.

covidTemp['Weeks'] = covidTemp['Week'].apply(str)covidTemp = covidTemp[(covidTemp['Week'] != '2020-11-12/2020-11-18')]fig2 = px.choropleth(covidTemp,
locationmode ="USA-states",
locations="state",
scope="usa",
color='Cumulative Positive 7 Day Ave (%)',
color_continuous_scale="Viridis",
range_color=(0, 8),
hover_name="state",
animation_frame="Weeks",
category_orders={'Weeks':['2020-01-16/2020-01-22', '2020-01-23/2020-01-29', '2020-01-30/2020-02-05', '2020-02-06/2020-02-12', '2020-02-13/2020-02-19','2020-02-20/2020-02-26', '2020-02-27/2020-03-04', '2020-03-05/2020-03-11', '2020-03-12/2020-03-18', '2020-03-19/2020-03-25', '2020-03-26/2020-04-01','2020-04-02/2020-04-08', '2020-04-09/2020-04-15', '2020-04-16/2020-04-22', '2020-04-23/2020-04-29', '2020-04-30/2020-05-06', '2020-05-07/2020-05-13','2020-05-14/2020-05-20', '2020-05-21/2020-05-27', '2020-05-28/2020-06-03', '2020-06-04/2020-06-10', '2020-06-11/2020-06-17', '2020-06-18/2020-06-24','2020-06-25/2020-07-01', '2020-07-02/2020-07-08', '2020-07-09/2020-07-15', '2020-07-16/2020-07-22', '2020-07-23/2020-07-29', '2020-07-30/2020-08-05','2020-08-06/2020-08-12', '2020-08-13/2020-08-19', '2020-08-20/2020-08-26', '2020-08-27/2020-09-02', '2020-09-03/2020-09-09', '2020-09-10/2020-09-16','2020-09-17/2020-09-23', '2020-09-24/2020-09-30', '2020-10-01/2020-10-07', '2020-10-08/2020-10-14', '2020-10-15/2020-10-21', '2020-10-22/2020-10-28','2020-10-29/2020-11-04', '2020-11-05/2020-11-11', '2020-11-12/2020-11-18']})
fig2.show()
Figure 7. Cumulative 7 day average cases of COVID-19 by population across the US
Figure 8. Cumulative positive increase of COVID-19 cases adjusted for population

Although it is not particularly germane to the overall topic, we thought the above graphs were really helpful in getting a fuller understanding of the spread of COVID-19 time-lapsed across the US. From it’s first publicized cases in NY, WA, and FL to the sudden spread across the US in a matter of weeks, it became very stark in this visualization.

We also started identifying important variables at this point, and merged the Housing data and COVID-19 infection rates. Below is the results of a detailed EDA of the four target variables, Housing Insecurity, Income Insecurity, COVID-19 cumulative 7 day infection rate, and COVID-19 increase in infections.

DF = pd.merge(covidTemp,StatesP1, on=['Week Number', 'state'], how='inner')DF2 = DF[['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Ave (%)"]]
g = sns.pairplot(DF2,hue="Division", height=5)
g.map_lower(sns.kdeplot, levels=4, color=".2")
Figure 9. Detailed EDA of Housing Insecurity + COVID-19 data combined, the variables are Experienced Loss of Employment Income Percent, Housing Insecurity Percent, Positive Increase %, Positive %, separated by Divisions.

The graph below is to help show how housing insecurity is higher when there are also higher levels of loss of income and as COVID-19 cases rise. States are sorted from highest to lowest on an average of all four variables.

DF3 = DF.groupby(['state'])[['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Ave (%)"]].sum().reset_index()DF3[['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Ave (%)"]] = DF3[['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Ave (%)"]]/12DF3[['Average']]=(DF3['Experienced Loss of Employment Income Percent'] + DF3["Housing Insecurity Percent"] + DF3["Positive Increase %"] + DF3["Cumulative Positive 7 Day Ave (%)"])/4sns.set_theme(style="whitegrid")
g = sns.PairGrid(DF3.sort_values("Average", ascending=False),
x_vars=DF3.columns[1:5], y_vars=["state"], height=10, aspect=.5)
g.map(sns.stripplot, size=10, orient="h", palette="flare_r", linewidth=1, edgecolor="w")g.set( xlabel="", ylabel="")titles = ['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Average %"]for ax, title in zip(g.axes.flat, titles):
ax.set(title=title)
ax.xaxis.grid(False)
ax.yaxis.grid(True)
sns.despine(left=True, bottom=True)
Figure 10. Graphical representation of the relationship between the selected variables from the Household Pulse Survey and the New York Time’s COVID-19 data by state.

Our goal was to understand how COVID-19 has affected housing, and we already have some excellent evidence that as COVID-19 cases increase, so does housing and income insecurity. A more concretized representation of this is by doing correlations.

We looked at correlations between the Household Pulse Survey’s housing insecurity measure, their loss of income due to COVID measure, and percent cumulative increase (7 day average) and percent increase that were calculated from the New York Time’s COVID-19 data at the state level. Below are the correlations at the division level that shows a strong (.8) to a non-existent (near .0) relationship between Housing Insecurity and Positive Cases (by percent of population). Thus the strength of the relationship varies based on region/division. However, this is evidence enough to show that these variables (loss of income, housing, and COVID-19 cases) are potentially related.

correlations = DF.groupby(['Division'])[['Experienced Loss of Employment Income Percent', "Housing Insecurity Percent", "Positive Increase %", "Cumulative Positive 7 Day Ave (%)"]].corr().unstack().unstack().reset_index()def facet_heatmap(data, color, **kws):
data = data.pivot(index="level_0", columns='level_1', values=0)
sns.heatmap(data, cmap='Blues', **kws, annot=True)
with sns.plotting_context(font_scale=5.5):
g = sns.FacetGrid(correlations, col="Division", col_wrap=3,
height=4, aspect=1)
cbar_ax = g.fig.add_axes([.92, .3, .02, .4])
g = g.map_dataframe(facet_heatmap, cbar_ax=cbar_ax, vmin=0, vmax=1) g.set_titles(col_template="{col_name}", fontweight='bold', fontsize=18)
g.fig.subplots_adjust(right=.9)
Figure 11. Correlations between the selected variables from the Household Pulse Survey and the New York Time’s COVID-19 data by Division

In order to do clustering, we had to determine how many clusters there were. One method of doing this is looking for an ‘elbow’ curve in the sum of squared errors (SSE) as shown below. As the elbow can be difficult to tell, another method is by calculating a silhouette plot, or a measure of how well a cluster coalesce and is cohesive. For silhouette plots, the spike or highest point on the graph is the recommended number of clusters. With unstandardized/non-normalized data, this turned out to be two clusters.

from sklearn.metrics import silhouette_scorekmeans_kwargs = {"init": "random","n_init": 10,"max_iter": 300,"random_state": 42}sse = []for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(X)
sse.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
silhouette_coefficients = []for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(X)
score = silhouette_score(X, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
Figure 12. Sum of Squared Errors plot of selected variables from non-normalized data
Figure 13. Silhouette plot of selected variables from non-normalized data

To explain simply, K-Means Clustering is an unsupervised learning algorithm that is used to solve clustering problems. It follows a simple procedure of classifying a given data set into a number of clusters, defined by the letter “k,” which is fixed beforehand.

The process is roughly:

1. Determine desired number of clusters (k)

2. Randomly seed the n-dimensional space occupied by the data with points called ‘centroids’, these are the centers of the clusters. These can be randomly placed because they are likely to move.

3. Find the nearest data to each centroid.

4. Once the data is assigned to a centroid, find the mean of that cluster (the ‘center’) and move the centroid to that position.

5. Repeat steps 3 and 4 until data are no longer switching clusters (i.e., the centroid stops moving from one repeat of 3 and 4 and the next)

6. Profit.

kmeansall = DF2
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)kmeans.fit(kmeansall)
alllabels = kmeans.predict(kmeansall)
centroids = kmeans.cluster_centers_
kmeansall['labels'] = pd.Series(alllabels, index=kmeansall.index)
kmeansall['state'] = pd.Series(DF2.state, index=kmeansall.index)
kmeansall['Week Number'] = pd.Series(DF['Week Number'], index=kmeansall.index)

Using two clusters, the data split into roughly low/high groups, where when housing insecurity and income insecurity were high and COVID-19 cases were on the rise or were high then they were in Cluster 2, while Cluster 1 (dark purple below) were areas of the US that experienced more stability. However, this model gave higher preference to housing insecurity and income insecurity as they were larger ‘percentages’ (e.g., in the 20s-40s instead of in the 0–8% that COVID-19 infection rates had).

fig3 = px.choropleth(kmeansall,
locationmode ="USA-states",
locations="state",
scope="usa",
color='labels',
color_continuous_scale="Viridis",
hover_name="state",
animation_frame="Week Number",
category_orders={"Week Number": [1,2,3,4,5,6,7,8,9,10,11,12]})
fig3.show()
Figure 14. Resultant K-Means clusters when k=2 for selected variables on non-normalized data

Thus we also wanted to look at the data if we min-max standardized the dataset. First we standardized the data.

min_max_scaler = preprocessing.MinMaxScaler()
temp1 = DF2.values
temp1_scaled = min_max_scaler.fit_transform(temp1)
DF2minmax = pd.DataFrame(temp1_scaled,columns=DF2temp.columns, index = DF2temp.index)

Using the same as code as before on the standardized data, the sum of squared errors looks similar, but the silhouette plot shows that while 2 is still the best number of clusters, 3 is a close second place. Thus we ran two different models one with 2 clusters and one with 3.

Figure 15. Sum of Squared Errors plot of selected variables from min-max normalized data
Figure 16. Silhouette plot of selected variables from min-max normalized data

The cluster analysis with K=2, where cluster 1 (dark purple) is states with low COVID-19 numbers and lower housing insecurity rates. Cluster 2 (yellow) is the exact opposite, where the percentages of housing insecurity are higher and COVID-19 numbers are increasing (in proportion to the total population).

Figure 17. Resultant K-Means clusters when k=2 for selected variables on min-max normalized data

The second graph shows the cluster analysis with k=3, where cluster 1 (dark purple) is states with low COVID-19 numbers and lower housing insecurity rates. Cluster 3 (yellow) is the exact opposite of cluster 1, while cluster 2 (green) is a cluster that is in the middle.

Figure 18. Resultant K-Means clusters when k=3 for selected variables on min-max normalized data

We believe that the case of k=3 is more valuable and captures edge cases compared to k=2.

In conclusion, there is a weak to strong relationship between COVID-19 infection rates and housing insecurity, which depends on the region. States can be grouped together as two clusters (high-low) or as three (high-medium-low) on these variables, with state adhesion to each group changing over time. We felt that the group with three levels was more reasonable as it accounted for edge cases and three groups was also acceptable according to the silhouette plot. The impact of COVID-19 extends beyond health, and is felt across America on a personal and economic level.

Work on this project was done by Jennifer Bochenek and Joe Larson, as part of a Capstone project for Drexel’s Master of Science in Data Science.

--

--