US Airports Analysis

Brandon Fung
INST414: Data Science Techniques
5 min readFeb 24, 2024

In an era where globalization has shrunk distances and interconnectedness defines our world, analyzing air traffic in the United States holds the key to unlocking insights into economic health, regional connectivity, and environmental impact. This vital artery of global commerce and mobility does more than ferry passengers from point A to point B; it serves as a barometer for economic activity, facilitates critical supply chains, and impacts local and national economies. By delving into the nuances of air traffic patterns, we can identify emerging economic trends, understand the resilience of various regions to disruptions, and spotlight areas for infrastructure investment and environmental sustainability initiatives. Moreover, air traffic data can help answer questions in the supply chain industry such as: Which airport is the most critical hub for cargo traffic or which airport has the most efficient connections in the US? Stakeholders asking these questions are supply managers at logistics companies who have to ship cargo in the most efficient way possible. In answering these questions, stakeholders will be able to allocate their shipments to various airports and optimize the shipment process.

The data was taken from Kaggle and contains all US flights from 1990 to 2008. Fields recorded consist of the origin airport, destination airport, origin city, destination city, passengers, seats, flights, distance, fly date, origin population, destination population, and the coordinates of both the origin and destination airports. With this data, I can analyze air traffic trends and identify the airports with the highest traffic and those with the highest interconnectivity.

Initially, the data was far too large to analyze given my computing constraints, so I was forced to clean and filter the data down. First, I filtered the data down to only flights from 2008, which resulted in about 300,000 data points. Then, I filtered out all flights that carried zero passengers which was about 37000 data points, as many of these flights were most likely just moving the aircraft to a different location. Although cargo flights typically do not carry passengers, I felt that the data quality with flights containing zero passengers was too uncertain to make a confident analysis. Instead, using flights that only contained passengers should give a general idea about the logistics of each airport in the US. Finally, I iterated through each row in the data frame, creating a node for each origin airport and an edge to the destination airport. The number of passengers for each flight was the edge weight because it has the closest relationship to flight traffic.

After fully cleaning the data, I could import it into Gephi for analysis. The first insight I wanted to find was what airport carried the most passengers out of it. This data will be useful because the airport with the most traffic may mean that the city is a major hub. Here are my results below:

From this graph, we can see that the airports with the highest traffic are divided into two distinct clusters. Each cluster is determined based on modularity, or in the context of air traffic, each cluster of airports has the most flights between each other. This sentiment is validated when we read the airport codes of each node, with most airports in green being on the West Coast and JFK, a very popular destination from California airports like LAX. Most airports in blue, however, are located on the eastern side of the US like MCO for Orlando and BOS for Boston. The size of the nodes corresponds to the average weighted degree of the airports with the most departing flights with the most passengers. ATL or Hartsfield–Jackson Atlanta International Airport in Atlanta, GA had the most departing passengers and flights on average.

Next, I wanted to find out which airports had the most interconnectivity to other airports. These airports act as a bridge, connecting two different clusters. They facilitate efficient travel between more remote or less connected airports, highlighting their strategic importance in the network. Here are the results:

This graph shows that MSP or Minneapolis–Saint Paul International Airport and PHX or Phoenix Sky Harbor International Airport lead the way in terms of interconnectivity. To come to these results, I calculated the betweeness centrality and related it to the size of the nodes. If we look closely at the edges from these two airports, we see many edges that go to both the same and different clusters, showing how these airports act as bridges between these two clusters or regions of the US.

Lastly, and arguably most importantly, I wanted to see which aiports had not only the highest connectivity, but the highest connectivity to important airports. These airports are well-connected to other well-connected airports, making them of high importance when traveling across the US. Here are my findings:

This graph is not as apparent in showing the most important airport with the most connectivity to other interconnected airports. To construct this graph, I calculated the Eigenvector centrality and related that to the size of the nodes. After going to the data laboratory and manually viewing the eigenvector centralities, ORD or O’Hare International Airport in Chicago, IL. It had an eigenvector centrality of 1, which is very unusual in a real-world setting. This could be due to numerous factors like ORD is part of a tightly interconnected group of airports that dominate the entire network in terms of connectivity, which could be true! Nonetheless, however, this graph shows that all these airports are very important to the entire network as a whole and serve as pivotal connection points to the rest of the US.

To answer my initial questions, Hartsfield–Jackson Atlanta International Airport in Atlanta, GA is the most critical hub due to the sheer traffic that it experiences annually. In terms of efficiency, O’Hare International Airport in Chicago, IL comes out on top, being the most connected to other well-connected airports. The next time you track your shipment, you will now know why it may travel through these airports.

Some limitations to consider, however, is that fact that the data used is from 2008. Air traffic trends could have very well changed since then, so this analysis serves as purely an example and not meant to be used for real-world applications. I also filtered out data with zero flights, which can be counterintuitive given the fact that cargo flights do not carry passengers. Once again, many flights with zero passengers also may not contain cargo or may contain very little cargo, so there is no way of ensuring the quality of the data without also collecting this information. Overall, however, this analysis tells a story of which airports are the most “important” in the US, depending on the context of the question.

The code for my analysis can be found here

--

--