Using the Computational Tools of R to Map the County by County Spread of COVID-19 in the United States

Published in

Analytics Vidhya

15 min readMay 1, 2020

David Akinyemi, Nicholas Beati, and Margaret Smith | Bates College

“What are the patterns across counties in the US regarding the spread of coronavirus cases since January 2020?”

COVID-19 (The Novel Coronavirus) is a virus that has taken the world by storm and for a lot of people, changed life as we know it. This virus appeared abruptly and has forced many changes in our daily lives with schools shutting down, companies closing their doors, and the start of social distancing. These cautionary measures have been implemented in the effort to slow down the spread of the virus in order to give scientists more time to find a cure and put an end to this pandemic. This virus is one that spreads quickly and has been growing rapidly, especially the United States, which currently leads the world in both confirmed cases and deaths. As a group we would like to use the computational tools of R and our acquired knowledge of computational networks to simulate the progression of the spread of COVID-19 in the U.S.

We looked at how the virus could have traveled through neighboring counties. We wanted to analyze any potential patterns in the counties of the U.S. and reveal what could be further examined in regards to the number of coronavirus cases, starting in January 2020. Our plan, as demonstrated below, is to simulate the spread of COVID cases on a biweekly schedule. We chose to use the computational tools of R to facilitate this process as well as represent network structure information in terms of various quantitative graphs such as producing plots for each time slice of the virus’s movements and calculate the centrality of counties for each time slice. This analysis allowed us to practice connecting theory to data as well as deriving insights about the world around us from professional data. We will be using these two data sets: infection data from the New York Times and an adjacency county matrix of the United States.

https://www.kansashealthsystem.com/covid19-update

BACKGROUND

The tragic events that consume our current day all lead back to the spread of the COVID-19 pandemic. For over a month now, people have been avoiding travel, avoiding socialization, wearing masks and gloves, working from home, getting takeout instead of going to restaurants, and ordering everything possible from Amazon, all in order to protect themselves and others from this virus. Not many expected the extent of its rapid spread when it first circulated Wuhan, China. Unfortunately, it has been swiftly making its way across the world, infecting millions. The questions that many want to know and likely ponder on a daily basis include: why is it travelling so fast, and what makes it spread?

To start, as a New York Times article states, “analysis might determine that, say, a 12-year-old boy living in central Redmond, Wash., near Seattle, will come into regular contact with his parents, his sister, and an average of 20.5 fellow students at his local middle school” (source: nytimes.com). Considering analyses such as this one, which demonstrate just how many people an average person in the U.S. comes in contact with, the American government is working hard to implement protocols to avoid human contact, reduce the number of cases, ultimately “flattening the curve”. Some of these protocols include shelter in place laws, curfew laws, cancellation of crowded events, and a 6-foot limit to interact with people.

Despite this, essential businesses must continue to operate in order to fulfill people’s needs. Strict social distancing regulations do not promote proper life for citizens or a healthy economy. Additional research on patterns and trends are vital to finding a possible cure and/or vaccine. Although the downsides of virus spread are tremendous, we have potential to learn a lot through the data of the spread. For example, Linked: How Everything is Connected to Everything Else and What It Means for Business, Science, and Everyday Life, by Albert-Laszló Barabási, discusses how data from viruses such as AIDS helped make important studies possible, motivating our research. There is evidence that the virus more seriously affects older people and people with underlying health conditions, but it has also been fatal to people with neither of these characteristics, making it clear that likely no one is safe from getting it and being severely affected by it, no matter how young or healthy.

Treatment Measures and Controlling the Spread

On the top of all governors’ lists today are preventative measures necessary to control the spread of the current global pandemic. This year isn’t the only concern, however. The use of contact tracing as well as increased testing could be vital in reducing the spread of a possible second wave next winter. Even though contact tracing might not be too effective among the young population who are frequently asymptomatic, recent research has suggested it can have significant implications on people inside healthcare facilities or nursing homes, two large areas of concern. The biggest challenges of implementing large amounts of these tracers are funding and production. On Wednesday, Gov. Andrew Cuomo announced that Mike Bloomberg, the billionaire philanthropist and former mayor of New York City, will help the state develop and implement an aggressive program to test for COVID-19 and trace people who have had contact with infected individuals.

After weeks of social distancing, millions of Americans are extremely ancy to leave their homes and allow businesses and schools to reopen. In order to do that as safely as possible, systems need to be put in place to identify people who may have been exposed to the virus and support them as they isolate. The Bloomberg School of Public Health at Hopkins will “build an online curriculum and training program for contact tracers, Bloomberg Philanthropies said in a release. The organization plans on partnering with the New York Department of Health to recruit “contact tracer candidates” from a variety of state agencies, counties and public universities” (source: CNBC). Contact tracing has been successful already in helping mitigate the spread of COVID-19. Several countries, such as Germany, Singapore and South Korea, have used contact tracing effectively, and subsequently have been able to re-open for business while experiencing fewer deaths and lower rates of infection. It is important to note that the results we generate will only suggest possible examples of contagion, but there is so much unknown in this unprecedented situation.

Data Interpretation & Assumptions

We had to make decisions and assumptions regarding our data in order to properly run the analysis that we wanted. The first assumption is that the dataset we are given has the accurate number of cases for these counties in the U.S. Due to testing ambiguity, the amount of confirmed cases are difficult to validate. Another assumption is that since all counties do not report each day merging our time frame into 14-day increments generally works but caused us to miss out on crucial parts of data. Also, in order to limit extraneous variables, we tracked merely domestic, adjacent movement among counties. This limited us to the forty eight landlocked states, leaving out Alaska and Hawaii.

CODE METHODOLOGY: Task Completion, Execution & Professional Practice

The original two data sets were pretty complicated for many reasons but one crucial aspect to highlight was how a new row would be added every time a county reports, even if they are not reporting new cases. This meant that we have to really focus on using our skills to best show how the network changed over time since stagnant snapshots could be very misleading. This computational process was broken into four critical steps. Each step allowed us to manipulate and execute many different tools to show how the cases have grown over time and put our dataset in the best position to produce informative visuals.

STEP 1

The goal of step one was to produce a data frame containing the total count for each reporting county at the end of each week. This started with cleaning out the dataset of any cells and working exclusively within the provided dates, county identifying Federal Information Processing Standard (fips) codes and the amount of cases. Subsequently, we converted all dates to their nearest Sunday ceiling date in order to get the holistic weekly view. By keeping only the unique values within this dataset, we converted the end-of-week-dates in the dataframe to week numbers. We continuously subsetted the data frame into weeks and arranged each week’s subset dataframe by their counties. This helped us store the highest count for each county weekly, thus converting the New York Times COVID data into a one row per county per week data frame containing the end count.

# Create an empty df to hold the highest count for each county for each week
highest_count <- data.frame()# Write a for-loop that does the following:
# 1) Subset COVID df into weeks.  
# 2) Subset each week subset df into counties
# 3) Keep only the highest count for each county for each week
# 4) Store the resulting df, which should have just the end count for every week for every county, in a new variablefor(w in 1:length(weeks$weeks)) {
    Working_week <- subset(COVID, COVID$week_number == w)
    Working_week <- (unique(Working_week))
    Working_week <- aggregate(Working_week$cases, by = list(Working_week$fips), max)
    Working_week$week_number <- w
    highest_count <- rbind.fill(highest_count, as.data.frame(Working_week))
}
names(highest_count)[1] <- "fips"
names(highest_count)[2] <- "cases"
highest_count <- as.data.frame(highest_count)

STEP 2

In order to track the spread of COVID-19 cases from one county to another, we had to import and create a variable for a second dataset consisting of all the counties in the U.S.. Counties were labeled by their corresponding identification codes and various neighbors. For this step, the biggest challenge was distinguishing counties that were not included in both datasets. Using the “%in%” line of code in R, counties that appeared in both datasets were labeled as TRUE and used exclusively moving forward.

## STEP 2: Creates a data frame containing neighboring county# Read in neighboring county data
NEIGHBORS <- read.csv("neighborcounties.csv")# Drop all neighbor pairs where one of each pair does not appear in all_counties
NEIGHBORS <- NEIGHBORS[NEIGHBORS$orgfips %in% COVID$fips,]
NEIGHBORS <- NEIGHBORS[NEIGHBORS$adjfips %in% COVID$fips,]

STEP 3

Here, we analyzed the rise in cases among all counties. We generated a matrix which had 14 columns, illustrative of the number of weeks we researched and 2752 rows depicting the number of counties present in both. First, we defined total weeks to help us formulate the matrix, which was 14 as discussed above. Next, we created another variable that stored all county codes as a list and arranged them in ascending order. An empty data frame was constructed to input our values. To complete this, we created a loop that matched weeks with counties and their corresponding number of cases. Finally, we were interested in the change from week to week, so we created a new data frame to hold differences. In this scenario, we looked at the same county over a specific two week period. For example, the number of new cases in week 2 indicated the total number of confirmed positive individuals measured in week 2 minus the number measured in week 1.

# Loop through all of the weeks to count total cases per county each week
for(w in 1:totalweeks) {
    subset_week <- highest_count[(highest_count$week_number == w),]     # Subset the df produced in step 1 to just one week
    for(r in 1:nrow(subset_week)) {     # Loop through the rows in the subset
        county_id <- as.numeric(subset_week$fips[r])         # Store the county code in a variable
        number_cases <- as.numeric(subset_week$cases[r])     # Store the number of cases in a variable
        CC_matrix[which(as.numeric(row.names(CC_matrix)) == county_id), w] <- number_cases # Store the number of cases at the intersection of the county code and the week
    }
}# Create a new df that contains only new cases each week 
CC_matrix_diff <-matrix(c(NA), ncol=ncol(CC_matrix), nrow=nrow(CC_matrix), byrow=TRUE)
cols <- ncol(CC_matrix) #14 should be the number of weeks
cols_minus_1 <- ncol(CC_matrix) - 1
rows <- nrow(CC_matrix) #Should be 2thousandish, the amount of counties
rownames(CC_matrix_diff)<-1:rows;colnames(CC_matrix_diff)<-1:cols# Loop through each column of your cumulative totals df to keep only new cases of COVID
# (the number of new cases in week 2 will be the total number measured in week 2, minus the number measured in week 1)
for(i in 1:rows){for(j in 1:cols_minus_1){CC_matrix_diff[i,j+1]<-(as.numeric((CC_matrix)[i,j+1])-as.numeric((CC_matrix)[i,j]))}}
CC_matrix_diff <- as.data.frame(CC_matrix_diff)
CC_matrix_diff$"1" <- CC_matrix$"1"
CC_matrix_diff <- as.matrix(CC_matrix_diff)
rownames(CC_matrix_diff)<- rownames(CC_matrix)

STEP 4

The final step, step 4, used the data frame which contained the total count for each reporting county weekly, created in step one. We used it to produce a data frame that had one row for each county, and one column for each week of the pandemic, and record of the total number of cases. This explicitly depicted the growth in cases and produced an edge list showing whether transmission happened in a two-week period. For every pair of consecutive weeks, we looped through all county neighbors to see which pairs meet the following conditions:

Whether the original county (orgfips) had an increase in the number of cases of COVID in week 1
Whether the adjacent county (adjfips) had an increase in the number of cases of COVID in week 2.

From this, we determined whether transmission had occurred or not. We subsetted that dataframe and kept only the edges connected through transmission. This enabled us to produce a new and updated edgelist for each new two-week period and observe the spread of cases over time.

# Subset dataframe, keeping only the edges connected by whether there was transmission
COVID_week <- function(first_week, second_week) {
    
    week_pair <- CC_matrix_diff[,first_week:second_week]
    
    COVID_pairs <- as.data.frame(NEIGHBORS)
    COVID_pairs$transmission <- ""
    
    for(n in 1:nrow(COVID_pairs)) {
        orgfips <- as.numeric(COVID_pairs$orgfips[n])
        new_cases_w1 <- as.numeric(week_pair[which(row.names(week_pair) == orgfips),1])
        adjfips <- COVID_pairs$adjfips[n]
        new_cases_w2 <- as.numeric(week_pair[which(row.names(week_pair) == adjfips),2])
        transmission <- new_cases_w1 > 0 & new_cases_w2 > 0
        COVID_pairs$transmission[n] <- transmission
    }COVID_pairs <- COVID_pairs[which(COVID_pairs$transmission == TRUE),]
    COVID_pairs <- COVID_pairs[,1:2]
    COVID_pairs <- as.matrix(COVID_pairs)
    row.names(COVID_pairs) <- NULL
    return(COVID_pairs)
}

We converted this process into a function and looped it over all our two-week intervals. This produced a plot for each time slice, which we then calculated centrality and betweeness for.

RESULTS

Visual showing the progression of our network as cases spread domestically across counties.

We focused on three periods to complete an in-depth network analysis. First, we chose week period 5–6 because it offered a clear starting point of our network. Every period before this week contained the same information while all those after would increase in size with time. We continued to focus on week period 10–11 because we saw this as a perfect intermediary moment to show the progression of our dataset and how the virus has spread and been transmitted. Last, we looked at week period 13–14 since this was the final period of our data set. This two week period was critical to analyze the effectiveness of stay at home orders issued with a goal of drastically reducing weekly new cases. We thought the progression of these three sets could provide a holistic view of the network by deeply analyzing some crucial time slices.

Week 5–6

After examing our visualization from week period 5–6, we observed relatively stagnant behavior. These dates correlated to the end of February and early March time, a period where there the number of known cases was minimal. Though this network was small, we were still able to run an abbreviated summary analysis on the network. According to our data frame, only two cases present, indicative of the two nodes present in the plot above. As a result, we did not receive very informative results when we ran centrality and betweenness tests.

Week 10–11

Looking at the visualization for week period 10 –11, you are able to see the formation of more compact clusters within the network. This illustrates that more counties were spreading cases to each other across the U.S. This was the last week in March, when cases were beginning to peak throughout the U.S. At this time, the knowledge of transmission became more prevalent due to the abundance of cases. Furthermore, the statistical network summary for week period 10–11 network produced many revealing conclusions.

This network in this time frame has 8907 edges, or connections, representing spread of the virus among U.S. counties and 2363 nodes, the counties themselves. In regards to our dataset, this is a significantly high number of edges, surpassing week period 13–14 (see below). Additionally, this network has a large diameter of 87. This is due to the multitude of clusters in the network; all the counties are not connected to each other because there is still a large spread of nodes. This confirms the idea that each counties cases does not directly impact every other county in the country; there may still be independent clusters of counties that spread to each other in different places around the country, even at a peak of the virus. Our transitivity for this time period is about 0.368, telling us that there were about 37% of closed triples compared to the number of potential closed triples. Again, this reaffirmed that not every single node, or county, is connected to all other nodes through the spread of the virus that is possible, but still a significant number are. We tested centrality for this network using betweenness and we found that Warrick County, Indiana, had the highest betweenness score at this time of the virus. Our hypothesis for this finding is that Warrick County, Indiana is heavily landlocked and central in our country. This means it is a gateway for many counties to be connected to it on all sides and for the virus to spread through it into these counties from all different directions, extending very far.

Week 13–14

From our statistical network summary, we are drawn immediately to the changes in the number of nodes, edges, and diameter of the network from week period 10–11 to week period 13–14. It appears as though the peak of the first novel coronavirus wave took part during the week period 10–11 (last week in March, 2020). Nodes, edges, and diameter all decrease significantly during this span and can be attributed to stay at home orders. This makes sense as the majority of the country was in a partial lockdown with only essential businesses open and serious warnings advising individuals to stay at home and minimize travel.

Some states such as Rhode Island have even arrested golfers who tried to violate the order issued by the governor, stating only citizens of RI are permitted to play golf at their courses. This is all an attempt to minimize the spread of the virus from hotspots at a given time. A cool caveat illustrating the effectiveness of this initiative is the social distancing index. We can view building entry activity by state and industry. We can track which numbers decrease the most between the middle and end of March (source: openpath.com). In week period 13–14, the top county for betweenness was Wilkinson County, located in central Georgia. As discussed above with Warrick County in Indiana, Wilkinson County is also landlocked with the potential of easy transmission due to its proximity to numerous neighbors. The number of cases isn’t alarming, but we do see an exponential rise for Wilkinson County.

CONCLUDING REMARKS

The “Network Spread of COVID-19” holistic visual undoubtedly tracks the spread of the virus over our 14 week period. We take you through different stages of the pandemic providing qualitative and quantitative research. The peak appears to take place during weeks 10–12. Our visualizations demonstrate that clusters within the network form during the peak of COVID-19 cases. We start to see a disassembly of nodes and edges towards the final week analyzed, providing us authentication that state closings have worked and proved vital to the lives of millions of Americans.

This experience was one that allowed us to take an in-depth look at a real word network and how it progressed through time. Using the computational tools of R, we simulated the progression of this virus. Though this simulation comes with many experimental limitations, we believe this analysis provides very important and beneficial information about the virus. Moving forward, we hope to transfer this network to a geospatial map and work to potentially animate the growth of this network on a map using JavaScripts D3.

Full Code on Github: https://github.com/DAkinyemi/COVID_19_Network_Analysis

REFERENCES

Barabási, Albert-Laszló. Linked: How Everything is Connected to Everything Else and What It Means for Business, Science, and Everyday Life. Perseus, 2002.

Carey, Benedict. “Mapping the Social Network of Coronavirus.” The New York Times, The New York Times, 13 Mar. 2020, www.nytimes.com/2020/03/13/science/coronavirus-social-networks-data.html.

“COVID-19 Social Distancing Index & Response by State and Industry.” COVID-19 Social Distancing Index & Response by State and Industry, www.openpath.com/social-distancing-index.

Skinner, Benjamin. “Neighbor Counties.” GitHub, 2020, github.com/btskinner/spatial/blob/master/data/neighborcounties.csv.

Sun, Albert. “New York Times COVID-19 Data.” GitHub, 2020, github.com/nytimes/covid-19-data/blob/master/us-counties.csv.

William Feuer, Kevin Breuninger. “Billionaire Mike Bloomberg Will Help New York Develop Coronavirus Test and Trace Program, Gov. Cuomo Says.” CNBC, CNBC, 22 Apr. 2020, www.cnbc.com/2020/04/22/billionaire-mike-bloomberg-will-help-new-york-develop-coronavirus-test-and-trace-program-gov-cuomo-says.html.