Travel itinerary using graph
Network graph analysis of airports and connections
Looking at the decline in the number of connections between airports during Covid, I wondered if there’s a way to find the shortest possible distance and time to visit my home country, India. Incidentally, my analysis work coincided with the same timeline. So, I considered using the graph network theory to identify the same.
The idea is to look at the available routes between airports and find the number of flights between them. More connections mean less travel cost and a better chance of getting tickets for a desired date/ time.
Though I started with global routes across all airports, I decided to keep my analysis limited to US airports for simplicity and readability of the graphs.
The codebase for this analysis can be found here:
Data source & network
I extracted two sets of data for my analysis-
1. Routes: Gives the source and destination airports IATA and the number of flights between them.
2. Airports: Gives details about airports.
After searching the web for open APIs, I found minimal data, so I decided to use the below URL for fetching data in JSON format from travel-payouts:
http://api.travelpayouts.com
The analysis will focus on identifying the below aspects using the data-
Nodes: Airports across the USA.
Edges: The number of inbound and outbound flights.
The weight of the edges is defined by the number of inbound/outbound connections with other airports (nodes).
Data preparation
Data collected in JSON format for airports and routes. Source data is structured as follows. It has airports from across the globe and their details ( e.g., IATA, code, Name, City, etc.)
# Getting airport data dumpurl = "http://api.travelpayouts.com/data/routes.json"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode("utf-8"))
Output:
[
{
“city_code”: “VEG”,
“country_code”: “GY”,
“name_translations”: {
“en”: “Maikwak”
},
“time_zone”: “America/Guyana”,
“flightable”: false,
“coordinates”: {
“lat”: 5.55,
“lon”: -59.283333
},
“name”: “Maikwak”,
“code”: “VEG”,
“iata_type”: “airport”
},
{
“city_code”: “IOR”,
“country_code”: “IE”,
“name_translations”: {
“en”: “Kilronan”
},
“time_zone”: “Europe/Dublin”,
“flightable”: false,
“coordinates”: {
“lat”: 53.11667,
“lon”: -9.75
},
“name”: “Kilronan”,
“code”: “IOR”,
“iata_type”: “airport”
}
]
For a focused and targetted approach, I chose the below columns and reduced the dataset.
# Prepare a subset of the data with limited columnsairport_df = pd.DataFrame.from_records(airport_json, columns=['city_code', 'country_code','name','code'])
Next, I filtered the dataset to include US airports only since my journey will start from the US.
# Filter dataframe to use USA airports onlyairport_us = airport_df.query('country_code == "US"')
Now I will merge the airports dataset with the routes dataset to create my nodes(airports) and edges(routes)
# Get a list of indexes with USA airports for filtering of route dataairport_us_in = airport_us['code']
I will now analyze the route data. Routes dataset consists of IATA code, many airport connections, flight details, etc.)
# Get routes data dumpurl = "http://api.travelpayouts.com/data/routes.json"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode("utf-8"))
Output:
{
“airline_iata”: “ZL”,
“airline_icao”: null,
“departure_airport_iata”: “NTL”,
“departure_airport_icao”: null,
“arrival_airport_iata”: “BNK”,
“arrival_airport_icao”: null,
“codeshare”: false,
“transfers”: 0,
“planes”: [
“SF3”
]
},
{
“airline_iata”: “ZL”,
“airline_icao”: null,
“departure_airport_iata”: “NTL”,
“departure_airport_icao”: null,
“arrival_airport_iata”: “SYD”,
“arrival_airport_icao”: null,
“codeshare”: false,
“transfers”: 0,
“planes”: [
“SF3”
]
}
Filtering the routes data for arrival and departure airports only.
# Prepare subset of data with limited columns
routes_us = pd.DataFrame.from_records(routes_df, columns=['departure_airport_iata', 'arrival_airport_iata'])
# Add a column to count flights between two airports
routes_us['flights'] = len(routes_df[0]["planes"])
Since I plan to start my travel from the US, I will only filter the routes for US airports.
# Filtered routes for origin and destination airports within USA
routes_us_f = routes_us.loc[(routes_us['departure_airport_iata'].isin(airport_us_in))
& (routes_us['arrival_airport_iata'].isin(airport_us_in))]
Now I will count the routes between two airports where the airport under analysis is either an origin or a destination. That means I’m counting the combinations(not the permutations) of the routes.
# Calculate the count between two airports in any direction
routes_us_g = pd.DataFrame(routes_us_f.groupby(['departure_airport_iata', 'arrival_airport_iata']).size().reset_index(name='counts'))
As you can see, this is a lot of routes. I need to narrow the dataset to focus on higher airport connections. I’m going to filter the dataset further with airports having connections of more than 5.
# Filter routes based on number connections more than 5
routes_us_g = routes_us_g[routes_us_g['counts'] > 5]
So, in my network graph, airports are the nodes, and routes between/through them are the edges.
Network graph
Using the prepared data set at the final step in the previous section, we now can plot graphs using networkx and matplotlib.
# Pass this dataframe to draw the network graph of airport connectivities
draw_graph(routes_us_g)def draw_graph(data):
plt.figure(figsize=(50, 50))# 1.Create the graphg = nx.from_pandas_edgelist(data, source='departure_airport_iata', target='arrival_airport_iata')
# 2.Create a layout for our nodes
layout = nx.spring_layout(g, iterations=50)
# 3.Styling
nx.draw_networkx_edges(g, layout, edge_color='#AAAAAA')dest = [node for node in g.nodes() if node in data.arrival_airport_iata.unique()]
size = [g.degree(node) * 80 for node in g.nodes() if node in data.arrival_airport_iata.unique()]
nx.draw_networkx_nodes(g, layout, nodelist=dest, node_size=size, node_color='lightblue')
orig = [node for node in g.nodes() if node in data.departure_airport_iata.unique()]
nx.draw_networkx_nodes(g, layout, nodelist=orig, node_size=100, node_color='#AAAAAA')
high_degree_orig = [node for node in g.nodes() if node in data.departure_airport_iata.unique() and g.degree(node) > 1]
nx.draw_networkx_nodes(g, layout, nodelist=high_degree_orig, node_size=100, node_color='#fc8d62')
orig_dict = dict(zip(orig, orig))
nx.draw_networkx_labels(g, layout, labels=orig_dict)
# 4.Plot the graphplt.axis('off')
plt.title("Connections between Airports(US))")plt.show()
I will now test the graph for various numbers of connections. The objective is to find the maximum number of connections through airports.
Starting with value 3. The below graph considers nodes having more than 3 connections through them. Important nodes that can be seen are LAX, DEN & ORD airports.
Notice that each node has color codes. Blue nodes are for originating connections, and orange nodes are for destination connections. The size of the circles depicts the weight proportionate to the number of connections originating from/destined to them.
Based on the above graphs, though many airports have several more than 3, only 3 airports stand out. ORD, DEN, and LAX. We will keep testing higher numbers.
When we draw the graph with the number of connections more than 5, we get a smaller graph-
Here also same three airports stand out LAX, DEN and, ORD.
Finally, to show the full graph. Below is the graph showing all nodes and edges(connections) across all airports in the USA.
Centrality measures
Centrality measures are the measures in a graph that depict the proximity factors between nodes in a network graph system.
Major centrality measures are as follows:
Degree Centrality
In a directed graph it measures the number of connections inbound and outbound separately. Still, a non-directed graph shows the number of connections that originate/is destined for a node.
Formula:
# prepare graph object using dataset
g = nx.from_pandas_edgelist(data, source='departure_airport_iata', target='arrival_airport_iata')# calculate degree centrality
deg_cen = nx.degree_centrality(g)
data_deg_cen = pd.DataFrame(deg_cen.items())# create dataframe from teh dict result
data_deg_cen = data_deg_cen[data_deg_cen[1] > 0.05]# plot the histogram
plt.bar(data_deg_cen[0], data_deg_cen[1])
plt.xlabel('Airports')
plt.ylabel('Degree Centrality')
plt.show()
Output:
Closeness Centrality
It measures the direct connections with other nodes without any intermediate hops.
Formula:
# calculate closeness centralitycl_cen = nx.closeness_centrality(g)
data_cl_cen = pd.DataFrame(cl_cen.items())
data_cl_cen = data_cl_cen[data_cl_cen[1] > 0.05]# plot the histogram
plt.bar(data_cl_cen[0], data_cl_cen[1])
plt.xlabel('Airports')
plt.ylabel('Closeness Centrality')
plt.show()
Output:
Betweenness Centrality
Betweenness is calculated by the number of routes passed through a node in a network. That means there can be an ‘n’ number of connections between two or more nodes via a specific node.
Formula:
# calculate betweenness centralitybet_cen = nx.betweenness_centrality(g)
data_bet_cen = pd.DataFrame(bet_cen.items())
data_bet_cen = data_bet_cen[data_bet_cen[1] > 0.05]# plot the histogram
plt.bar(data_bet_cen[0], data_bet_cen[1])
plt.xlabel('Airports')
plt.ylabel('Betweenness Centrality')
plt.show()
Output:
I’m looking for the airports with the most number of connections through them, i.e., airports that are origins and/or destinations. They also connect other origins and destinations through them. Those will be my candidate since I’m looking for an international itinerary.
So, I will look for the Betweenness centrality of these airports(nodes). From the above result, we can easily figure out the three airports which are having the highest betweenness centrality are Denver, Los Angeles, and Orlando-
‘DEN’: 0.3406938389233117
, ‘LAX’: 0.2579880831766526
, ‘ORD’: 0.36524506672198287
Conclusion:
We know the airports are connected with zero or more connections with other airports. A connection (edge) means there are flights between the airports. I have selected USA airports only and further reduced the list by selecting the airports with more than three connections and more than five connections.
There are two graphs, and the findings are interesting. ORD (Orlando), DEN (Denver), LAX (Los Angeles) stand out as they have the most number of connections through them. They can be categorized as Tier one of busy airports.
Tier two consists of DFW, PHX, SFO, and SEA. They also have at least three or more connections.
With the above findings, as I have mentioned earlier, the betweenness centrality values also indicate that the 3 airports which are a prime candidate for an international itinerary would be ORD, LAX & DEN.
The graph shows the nodes according to their number of connections. Color blue denotes the number of origin connections, and color orange represents the number of destination connections.
Based on the graph analysis and betweenness centrality, we can extrapolate that flights connecting ORD, DEN, or LAX are the optimal routes for any itinerary to travel within the U.S. It is also implied that tickets for those connections will have more options even after few ad-hoc cancellations during Covid. These options can also be cost-effective for travelers since the supply-demand ratio is healthy.
Limitations:
- Data is outdated.
- The graphs do not show the actual geographic positions of the airports on the map.
- The graphs also do not show the weight of the edges.
Libraries and frameworks:
Language: Python
IDE: Pycharm
Python Libraries include-
Graph calculations: NetworkX
Graph plot: MatplotLib
Data manipulation: Pandas
Data extraction: urllib, JSON
References:
You can refer to the code written for this analysis here:
https://www.sciencedirect.com/topics/computer-science/centrality-measure