Identifying High Risk Airports for Disease Spread

Daniel Adams
INST414: Data Science Techniques
6 min readApr 29, 2024

By Daniel Adams

Introduction

Hundreds of thousands of international flights depart from the United States every year. This allows for a closer global community, giving those who travel an opportunity to immerse themselves in the cultures across the globe. While doing this is good, air travel can pose a multitude of risks. A critical risk is the spread of life threatening diseases, such as what had occurred with the COVID-19 pandemic. When news of the virus’ capability to spread and capacity to harm became a critical threat, the plan to stop international travel was too late to prevent spread of the disease. While there was very little known about the disease at the time, any other future pandemic level disease would surely be the same scenario of doubt. That being said, those in power of international air travel, such as the Federal Aviation Administration (FAA) and CDC, should be aware of which international airports in the United States are the most frequently used for international flights.

Hundreds of thousands of international flights depart from the United States every year. This allows for a closer global community, giving those who travel an opportunity to immerse themselves in the cultures across the globe. While doing this is good, air travel can pose a multitude of risks. A critical risk is the spread of life threatening diseases, such as what had occurred with the COVID-19 pandemic. When news of the virus’ capability to spread and capacity to harm became a critical threat, the plan to stop international travel was too late to prevent spread of the disease. While there was very little known about the disease at the time, any other future pandemic level disease would surely be the same scenario of doubt. That being said, those in power of international air travel, such as the Federal Aviation Administration (FAA) and CDC, should be aware of which international airports in the United States are the most frequently used for international flights

Question

COVID-19 was the first global pandemic since international air travel has become a common mode of transportation. That being said, there was no experience on any plans to minimize the spread into the United States beyond theory. While there most likely is a new updated contingency plan since the COVID-19 pandemic in respect to international travel, this analysis seeks to recognize the most critical points of international air travel in the United States. This begs the question, what are the most critical or frequently used airports for international travel within the United States?

Stakeholders and Approach

The stakeholders with the above question would be any government agency concerned with the creation and execution of a contingency plan for a pandemic level disease. Specifically, the FAA, CDC, and those in power to enact the plan. This data analysis will seek to determine which airports in the United States are most used for international flight. This will indicate the airports that yield the most risk and capability for spreading any disease.

Data

To gather the international flight data of the United States, I sourced a data set from Kaggle. This dataset included flights from 1990 to 2020 of only international flights connected with United States airports. Each row represented the flights between a United States airport and a foreign airport in a one month time period. In order to begin the analysis of the data, data points of interest were determined from the entire dataset. These included:

  • Year: The year of the flight
  • Usg_apt_id: The id of the United States airport
  • Usg_apt: The three letter abbreviation of the United States airport
  • Fg_apt_id: The id of the foreign airport
  • Fg_apt: The three letter abbreviation of the foreign airport
  • Total: Number of flights between the US Airport and foreign airport in that month

Data Cleaning

Before the analysis could be conducted, the data set had to be cleaned. The entire dataset includes 930,808 rows. Each row represents a month of flights between a United States airport and a foreign airport. For example, one row may include Miami (MIA) to Toronto (YYZ) during March of 2019. Since nearly a million rows existed, the number of rows had to be limited. In turn, this allowed the dataset to yield more relevant data, as improved airplane capabilities allow for more direct international flights. For example older data from 1990 could be skewed for short layovers that are no longer necessary. Therefore, the entire dataset was cut down to only include the entirety of 2019 and 2020 international flights. This allowed for up to date data from 2019. More interestingly, it allowed for 2020, which reveals which United States airports are used in the most restricted of circumstances. Once this step of the data cleaning was finished, the dataset had 48,239 rows remaining. Note, an indexing bug can occur during this process when utilizing a pandas dataframe. Since filtering data does not automatically update indices, using loc() or iloc() after the filtering process will return with indexing errors. This can be avoided by using the .reset_index(drop = True) method, which simply re-indexes the dataframe for 0 to n — 1.

Data Analysis

Once the dataset was cleaned, the analysis could begin. This was done by creating an undirected weighted graph, where nodes would represent the United States airports and the foreign airports. Edges between these nodes would represent any occurrences of flights between the two locations. Edge weights would sum the total flights of each monthly period in 2019 and 2020. This graph is shown below, where nodes with higher degree centrality are ranked from blue to red in increasing order. The edges are also ranked from blue to red in increasing order based on edge weight. Nodes and edges were filtered based on the node centrality and weight significance, which resulted in the most important nodes and edges being present. Finally, the layout was created using Yifan Hu layout algorithm, shown below.

In order to understand the most central airports used for international air travel in the United States, the analysis calculated centrality metrics on the graph. The first calculation was the degree centrality of the top 20 nodes. This shows the nodes (airports) with the most edges (connections to other airports), regardless of weight (number of flights). As pictured below, Miami, Toronto, JFK, Los Angeles international airports had the highest degree centrality. This means the stakeholder must prioritize these airports in the case of limiting international travel for disease control. Interestingly, TEB, or Teterboro, is a small relief airport in New Jersey.

Though these airports have the highest degree centrality we can further explore other centrality metrics to determine the quality of these edges. To do this, the betweenness centrality was calculated on the graph. This will show which nodes are between, or bridge, each other. By calculating this metric, it shows which nodes/airports have more control over the international flights out of the United States since more nodes pass through the nodes with higher betweenness centrality.

An interesting finding here is that JFK and LAX airports significantly drop in value when analyzing betweenness centrality. In respect to the stakeholders, they should prioritize flights involving Toronto (YYZ), Vancouver (YVR), and Miami (MIA) during any potential spread of disease as both have the highest frequency or flights passing through them. Overall, key airports the stakeholder should prioritize in respect to the prevention of disease spread are Miami, Toronto, Vancouver, and Los Angeles.

Limitations

A key limitation of this analysis stems from the data set used. While the data gives good insights on international flights departing from the United States, the dataset does not include flights with the United States as the destination. This limitation does not delegitimize the above analysis, but does bring about a gap in information needed to the stakeholder. For example, the stakeholder should know the most used airports that planes use to come to the United States. This would allow the stakeholder to designate which foreign airports have a high volume of travel to the United States, which increases their ability to spread diseases.

GitHub Code

https://github.com/dadams16/INST414AdamsModules/tree/main/module2

--

--