Travel itinerary using graph

Nayana Kumari
Web Mining [IS688, Spring 2021]
9 min readMar 19, 2021

Network graph analysis of airports and connections

Looking at the decline in the number of connections between airports during Covid, I wondered if there’s a way to find the shortest possible distance and time to visit my home country, India. Incidentally, my analysis work coincided with the same timeline. So, I considered using the graph network theory to identify the same.

The idea is to look at the available routes between airports and find the number of flights between them. More connections mean less travel cost and a better chance of getting tickets for a desired date/ time.
Though I started with global routes across all airports, I decided to keep my analysis limited to US airports for simplicity and readability of the graphs.

The codebase for this analysis can be found here:

https://github.com/nt27web/WebMining-NetworkGraph

Data source & network

I extracted two sets of data for my analysis-

1. Routes: Gives the source and destination airports IATA and the number of flights between them.

2. Airports: Gives details about airports.

After searching the web for open APIs, I found minimal data, so I decided to use the below URL for fetching data in JSON format from travel-payouts:

http://api.travelpayouts.com

The analysis will focus on identifying the below aspects using the data-

Nodes: Airports across the USA.

Edges: The number of inbound and outbound flights.

The weight of the edges is defined by the number of inbound/outbound connections with other airports (nodes).

Data preparation

Data collected in JSON format for airports and routes. Source data is structured as follows. It has airports from across the globe and their details ( e.g., IATA, code, Name, City, etc.)

# Getting airport data dumpurl = "http://api.travelpayouts.com/data/routes.json"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode("utf-8"))

Output:

[
{
“city_code”: “VEG”,
“country_code”: “GY”,
“name_translations”: {
“en”: “Maikwak”
},
“time_zone”: “America/Guyana”,
“flightable”: false,
“coordinates”: {
“lat”: 5.55,
“lon”: -59.283333
},
“name”: “Maikwak”,
“code”: “VEG”,
“iata_type”: “airport”
},
{
“city_code”: “IOR”,
“country_code”: “IE”,
“name_translations”: {
“en”: “Kilronan”
},
“time_zone”: “Europe/Dublin”,
“flightable”: false,
“coordinates”: {
“lat”: 53.11667,
“lon”: -9.75
},
“name”: “Kilronan”,
“code”: “IOR”,
“iata_type”: “airport”
}
]

For a focused and targetted approach, I chose the below columns and reduced the dataset.

# Prepare a subset of the data with limited columnsairport_df = pd.DataFrame.from_records(airport_json, columns=['city_code', 'country_code','name','code'])
Airport data sample
[Raw airport data]

Next, I filtered the dataset to include US airports only since my journey will start from the US.

# Filter dataframe to use USA airports onlyairport_us = airport_df.query('country_code == "US"')
[Filtered data with USA airports only]

Now I will merge the airports dataset with the routes dataset to create my nodes(airports) and edges(routes)

# Get a list of indexes with USA airports for filtering of route dataairport_us_in = airport_us['code']
[Index list containing USA airports]

I will now analyze the route data. Routes dataset consists of IATA code, many airport connections, flight details, etc.)

# Get routes data dumpurl = "http://api.travelpayouts.com/data/routes.json"

with urllib.request.urlopen(url) as url:

data = json.loads(url.read().decode("utf-8"))

Output:

{
“airline_iata”: “ZL”,
“airline_icao”: null,
“departure_airport_iata”: “NTL”,
“departure_airport_icao”: null,
“arrival_airport_iata”: “BNK”,
“arrival_airport_icao”: null,
“codeshare”: false,
“transfers”: 0,
“planes”: [
“SF3”
]
},
{
“airline_iata”: “ZL”,
“airline_icao”: null,
“departure_airport_iata”: “NTL”,
“departure_airport_icao”: null,
“arrival_airport_iata”: “SYD”,
“arrival_airport_icao”: null,
“codeshare”: false,
“transfers”: 0,
“planes”: [
“SF3”
]
}

Filtering the routes data for arrival and departure airports only.

# Prepare subset of data with limited columns

routes_us = pd.DataFrame.from_records(routes_df, columns=['departure_airport_iata', 'arrival_airport_iata'])
[Route data filtered with only departure and arrival airports]
# Add a column to count flights between two airports

routes_us['flights'] = len(routes_df[0]["planes"])
[New column added to display the number of flights]

Since I plan to start my travel from the US, I will only filter the routes for US airports.

# Filtered routes for origin and destination airports within USA

routes_us_f = routes_us.loc[(routes_us['departure_airport_iata'].isin(airport_us_in))
& (routes_us['arrival_airport_iata'].isin(airport_us_in))]
[Filtered routes data using airports index list]

Now I will count the routes between two airports where the airport under analysis is either an origin or a destination. That means I’m counting the combinations(not the permutations) of the routes.

# Calculate the count between two airports in any direction

routes_us_g = pd.DataFrame(routes_us_f.groupby(['departure_airport_iata', 'arrival_airport_iata']).size().reset_index(name='counts'))
[Calculated number of flights for a combination of airports]

As you can see, this is a lot of routes. I need to narrow the dataset to focus on higher airport connections. I’m going to filter the dataset further with airports having connections of more than 5.

# Filter routes based on number connections more than 5

routes_us_g = routes_us_g[routes_us_g['counts'] > 5]
[Routes data filtered further to eliminate connections less than 5]

So, in my network graph, airports are the nodes, and routes between/through them are the edges.

Network graph

Using the prepared data set at the final step in the previous section, we now can plot graphs using networkx and matplotlib.

# Pass this dataframe to draw the network graph of airport connectivities

draw_graph(routes_us_g)
def draw_graph(data):

plt.figure(figsize=(50, 50))
# 1.Create the graphg = nx.from_pandas_edgelist(data, source='departure_airport_iata', target='arrival_airport_iata')
# 2.Create a layout for our nodes

layout = nx.spring_layout(g, iterations=50)
# 3.Styling

nx.draw_networkx_edges(g, layout, edge_color='#AAAAAA')
dest = [node for node in g.nodes() if node in data.arrival_airport_iata.unique()]

size = [g.degree(node) * 80 for node in g.nodes() if node in data.arrival_airport_iata.unique()]

nx.draw_networkx_nodes(g, layout, nodelist=dest, node_size=size, node_color='lightblue')

orig = [node for node in g.nodes() if node in data.departure_airport_iata.unique()]

nx.draw_networkx_nodes(g, layout, nodelist=orig, node_size=100, node_color='#AAAAAA')

high_degree_orig = [node for node in g.nodes() if node in data.departure_airport_iata.unique() and g.degree(node) > 1]

nx.draw_networkx_nodes(g, layout, nodelist=high_degree_orig, node_size=100, node_color='#fc8d62')

orig_dict = dict(zip(orig, orig))
nx.draw_networkx_labels(g, layout, labels=orig_dict)
# 4.Plot the graphplt.axis('off')

plt.title("Connections between Airports(US))")
plt.show()

I will now test the graph for various numbers of connections. The objective is to find the maximum number of connections through airports.

Starting with value 3. The below graph considers nodes having more than 3 connections through them. Important nodes that can be seen are LAX, DEN & ORD airports.

Notice that each node has color codes. Blue nodes are for originating connections, and orange nodes are for destination connections. The size of the circles depicts the weight proportionate to the number of connections originating from/destined to them.

Network graph with the number of connections more than 3
Enhanced version for more clarity

Based on the above graphs, though many airports have several more than 3, only 3 airports stand out. ORD, DEN, and LAX. We will keep testing higher numbers.

When we draw the graph with the number of connections more than 5, we get a smaller graph-

Network graph with the number of connections more than 5
Enhanced version for more clarity

Here also same three airports stand out LAX, DEN and, ORD.

Finally, to show the full graph. Below is the graph showing all nodes and edges(connections) across all airports in the USA.

Network graph for all airports in the USA

Centrality measures

Centrality measures are the measures in a graph that depict the proximity factors between nodes in a network graph system.

Major centrality measures are as follows:

Degree Centrality

In a directed graph it measures the number of connections inbound and outbound separately. Still, a non-directed graph shows the number of connections that originate/is destined for a node.

Formula:

Degree centrality formula
# prepare graph object using dataset
g = nx.from_pandas_edgelist(data, source='departure_airport_iata', target='arrival_airport_iata')
# calculate degree centrality
deg_cen = nx.degree_centrality(g)
data_deg_cen = pd.DataFrame(deg_cen.items())
# create dataframe from teh dict result
data_deg_cen = data_deg_cen[data_deg_cen[1] > 0.05]
# plot the histogram
plt.bar(data_deg_cen[0], data_deg_cen[1])
plt.xlabel('Airports')
plt.ylabel('Degree Centrality')
plt.show()

Output:

Distribution of Degree centrality values among the airports having DC > 0.05

Closeness Centrality

It measures the direct connections with other nodes without any intermediate hops.

Formula:

Closeness centrality formula
# calculate closeness centralitycl_cen = nx.closeness_centrality(g)
data_cl_cen = pd.DataFrame(cl_cen.items())

data_cl_cen = data_cl_cen[data_cl_cen[1] > 0.05]
# plot the histogram
plt.bar(data_cl_cen[0], data_cl_cen[1])
plt.xlabel('Airports')
plt.ylabel('Closeness Centrality')
plt.show()

Output:

Distribution of Closeness centrality values among the airports having CC > 0.05

Betweenness Centrality

Betweenness is calculated by the number of routes passed through a node in a network. That means there can be an ‘n’ number of connections between two or more nodes via a specific node.

Formula:

[Betweenness centrality formula]
# calculate betweenness centralitybet_cen = nx.betweenness_centrality(g)
data_bet_cen = pd.DataFrame(bet_cen.items())

data_bet_cen = data_bet_cen[data_bet_cen[1] > 0.05]
# plot the histogram
plt.bar(data_bet_cen[0], data_bet_cen[1])
plt.xlabel('Airports')
plt.ylabel('Betweenness Centrality')
plt.show()

Output:

Distribution of Betweenness centrality values among the airports having BC > 0.05

I’m looking for the airports with the most number of connections through them, i.e., airports that are origins and/or destinations. They also connect other origins and destinations through them. Those will be my candidate since I’m looking for an international itinerary.

So, I will look for the Betweenness centrality of these airports(nodes). From the above result, we can easily figure out the three airports which are having the highest betweenness centrality are Denver, Los Angeles, and Orlando-

‘DEN’: 0.3406938389233117
, ‘LAX’: 0.2579880831766526
, ‘ORD’: 0.36524506672198287

Conclusion:

We know the airports are connected with zero or more connections with other airports. A connection (edge) means there are flights between the airports. I have selected USA airports only and further reduced the list by selecting the airports with more than three connections and more than five connections.

There are two graphs, and the findings are interesting. ORD (Orlando), DEN (Denver), LAX (Los Angeles) stand out as they have the most number of connections through them. They can be categorized as Tier one of busy airports.

Tier two consists of DFW, PHX, SFO, and SEA. They also have at least three or more connections.

With the above findings, as I have mentioned earlier, the betweenness centrality values also indicate that the 3 airports which are a prime candidate for an international itinerary would be ORD, LAX & DEN.

The graph shows the nodes according to their number of connections. Color blue denotes the number of origin connections, and color orange represents the number of destination connections.

Based on the graph analysis and betweenness centrality, we can extrapolate that flights connecting ORD, DEN, or LAX are the optimal routes for any itinerary to travel within the U.S. It is also implied that tickets for those connections will have more options even after few ad-hoc cancellations during Covid. These options can also be cost-effective for travelers since the supply-demand ratio is healthy.

Limitations:

  1. Data is outdated.
  2. The graphs do not show the actual geographic positions of the airports on the map.
  3. The graphs also do not show the weight of the edges.

Libraries and frameworks:

Language: Python

IDE: Pycharm

Python Libraries include-

Graph calculations: NetworkX

Graph plot: MatplotLib

Data manipulation: Pandas

Data extraction: urllib, JSON

--

--

Nayana Kumari
Web Mining [IS688, Spring 2021]

A Traveler at heart and Techie by profession!! Learn / Explore / Live Today.