Predicting Airport Cancellations Using GNNs

Zach Witzel
Stanford CS224W GraphML Tutorials
16 min readMay 15, 2023

By Zach Witzel and Xiluo He as part of the Stanford CS 224W final project

Image Source: https://www.forbes.com/sites/suzannerowankelleher/2022/06/30/the-smart-travelers-survival-playbook-for-summer-flight-cancellations/?sh=1c139b073985

Task Description and Motivation:

The aviation industry is a critical component of modern society, connecting people and businesses worldwide. However, airline infrastructure is a fragile system fraught with cancellations and delays. Flight cancellations lead to significant inconvenience for passengers and considerable costs for airlines. At the same time, there is also significant amounts of data recorded about flight histories at airports nationwide. Predicting flight cancellations at a given airport can be beneficial for both airlines and passengers, enabling them to make informed decisions and minimize disruptions. Due to the natural graph-like configuration of airports as vertices and flights acting as edges between them, this makes the problem an opportune issue to solve with graph neural networks. The goal of this project is to create a GNN model that takes in flight cancellation data for airports in the United States and associated features for them and can predict the expected percentage of flight cancellations at a given airport and year. From there, we demonstrate the usefulness of such information by creating a program which finds the route between 2 different airports with the minimum likelihood of cancellation.

Here are links to our project, colabs, and zipped data to more easily explore the material:

Dataset:

Data Sourcing and Features

Our graph is composed of airports as nodes and flights between airports as edges. We use node and edge data from 2 different data sources all originating from OpenFlights, an open source project that contains airport and commercial flight information from varying time spans [1].

Airport Location, Operation, Cancellation and Diversion Data from 2004–2014:

US Domestic Flights from 1990 to 2009:

For our node features, we use current information about the airport as well as past information. Current information contains number of airport diversions, gate information, airport location, flight taxi information, and airport delays. Past information contains label information from the previous 5 years as this information would be accessible during the year of interest.

For our edge/flight features, we included information about the number of passengers per year, the number of flights taken in a year, the number of seats available per year, and the average distance of the flight.

We normalize the node and edge features as well as the node labels using our training set.

Data Preprocessing

We took the most recent data we had available from our existing datasets and used 2014 airport data with 2009 flight data. While there are 383 airports in the US, we had data for 61 airports (29 large-sized airports, 26 medium-sized airports, and 6 small-sized airports). There were 123,443 flights available. However, this will lead to over-smoothing since virtually every node would be at most 2-hops from every other node in the network. To remedy this, we took the most populous flight from every departing airport and only considered flights above 200,000 annual passengers. We convert every edge into an undirected edge and coalesced duplicate edges. Through our pre-processing, we end up with an undirected, fully connected graph. Our final graph has 61 nodes, 27 node features, 746 edges, 4 edge features, and an average degree of 12.23.

For visualization purposes, we raised the annual passenger threshold to 700,000 in order to reduce the amount of edges present. The color of the nodes represent the label with darker colors representing a larger percent of canceled flights. Here is a picture of our graph in which we used the latitude and longitude of each airport to overlay on top of a Mercator projection map of the United States:

Graph Visualization for the Most Popular Flights

Splits

Our dataset only contains a subset of data that has been previously publicly extracted. This means that by splitting our graph inductively, we can generalize to US airports that do not exist in our dataset as well as international airports.

We use a 70/10/20% split between our training, validation, and test sets. We found that, because we have a small number of nodes, randomly splitting the nodes caused poor training behavior with the split having a large impact on performance. We hypothesized and tested splitting the dataset evenly with respect to airport size (eg. training set contains random 70% of large airports, random 70% of medium airports, random 70% of small airports). The training behavior with this split was much more representative of model capability.

Here is our implementation for splitting evenly with respect to airport size:

# Split evenly wrt airport size 

# Get indices of large, medium and small airports
larges = []
mediums = []
smalls = []
smallandnonhub_airports = smallhub_airports + nonhub_airports
for airport, airport_id in airport_to_id.items():
if airport in largehub_airports:
larges.append(1)
else:
larges.append(0)

if airport in mediumhub_airports:
mediums.append(1)
else:
mediums.append(0)

if airport in smallandnonhub_airports:
smalls.append(1)
else:
smalls.append(0)
larges = np.array(larges)
mediums = np.array(mediums)
smalls = np.array(smalls)

larges_inds = np.argpartition(larges, -larges.sum())[-larges.sum():]
np.random.shuffle(larges_inds)
mediums_inds = np.argpartition(mediums, -mediums.sum())[-mediums.sum():]
np.random.shuffle(mediums_inds)
smalls_inds = np.argpartition(smalls, -smalls.sum())[-smalls.sum():]
np.random.shuffle(smalls_inds)

# Split data
train_p = 0.7
val_p = 0.1
test_p = 0.2

larges_train_inds = larges_inds[:int(train_p*len(larges_inds))]
larges_val_inds = larges_inds[int(train_p*len(larges_inds)):int(train_p*len(larges_inds))+int(val_p*len(larges_inds))+1]
larges_test_inds = larges_inds[int(train_p*len(larges_inds))+int(val_p*len(larges_inds))+1:]

mediums_train_inds = mediums_inds[:int(train_p*len(mediums_inds))]
mediums_val_inds = mediums_inds[int(train_p*len(mediums_inds)):int(train_p*len(mediums_inds))+int(val_p*len(mediums_inds))+1]
mediums_test_inds = mediums_inds[int(train_p*len(mediums_inds))+int(val_p*len(mediums_inds))+1:]

smalls_train_inds = smalls_inds[:int(train_p*len(smalls_inds))]
smalls_val_inds = smalls_inds[int(train_p*len(smalls_inds)):int(train_p*len(smalls_inds))+int(val_p*len(smalls_inds))+1]
smalls_test_inds = smalls_inds[int(train_p*len(smalls_inds))+int(val_p*len(smalls_inds))+1:]

train_inds = np.concatenate((larges_train_inds, mediums_train_inds, smalls_train_inds))
val_inds = np.concatenate((larges_val_inds, mediums_val_inds, smalls_val_inds))
test_inds = np.concatenate((larges_test_inds, mediums_test_inds, smalls_test_inds))

# Create split masks
num_nodes = len(train_inds) + len(val_inds) + len(test_inds)
train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)
test_mask = torch.zeros(num_nodes, dtype=torch.bool)

for idx in train_inds:
train_mask[idx] = True
for idx in val_inds:
val_mask[idx] = True
for idx in test_inds:
test_mask[idx] = True

data_split = data
data_split.train_mask = train_mask
data_split.val_mask = val_mask
data_split.test_mask = test_mask

We can see our splitting scheme in the following plots. Our graphs are colored by the size of the airport (large, medium, small) with the darkest color representing the large airports:

Training Set Graph
Validation Set Graph
Test Set Graph

Explaining the Model:

Overview of GNNs

Graph neural networks (GNNs) are a class of machine learning models that have been designed specifically for handling data structured as graphs. With numerous applications spanning from biology to finance, GNNs have been getting increasingly popular in recent years. Recent development in research in this field has led to GNNs becoming more reliable and robust to varying graph structures, beating previous state-of-the-art models on many benchmarks. Current models deploy different message passing and aggregation functions to capture both local and global graph structure, allowing GNNs to perform well on node, edge, and graph level prediction tasks. Simple examples of GNNs include an MLP message passing function and mean-pooling aggregation function.

Image Source: https://snap-stanford.github.io/cs224w-notes/machine-learning-with-networks/graph-neural-networks

For each node in our graph, we get a computation tree consisting of the node’s neighbors and the neighbor’s neighbors. This computation tree can become arbitrarily large but a deep computation tree could lead to over-smoothing, a phenomenon where the computation trees of many nodes look similar and thus calculate a similar embedding for these nodes.

Pytorch-Geometric (PyG) is a popular Python framework built upon Pytorch for working with graph neural networks [2]. It allows us to fully define a graph, create GNN models, provide existing implementations for well-known GNN models, and give support for common graph transformations for end-to-end GNN training.

The GraphSAGE Model

GraphSAGE [3] is a specific GNN model that is well-suited for this task, as it can effectively learn embeddings for nodes in a graph (in this case, airports) by aggregating information from their local neighborhood (nearby airports and other relevant features).

The GraphSAGE model is appropriate for this task for several reasons:

  1. Spatial Relationships: The performance of airports and flight cancellations can be influenced by spatial factors, such as ground delays or the date of the flight. GraphSAGE can capture these relationships by aggregating information from nearby airports, allowing the model to learn spatial patterns that may be relevant to cancellations.
  2. Scalability: GraphSAGE is an inductive learning algorithm that can generalize to unseen nodes, which means it can handle graphs of varying sizes and structures. This is particularly useful for the flight cancellation prediction task, as new airports may be added over time and the structure of the graph may evolve.
  3. Feature Learning: GraphSAGE can learn node representations that incorporate both local and global information, which is crucial for this task. Local information, such as an airport’s individual attributes, as well as global information, such as relationships between airports, can both impact cancellation rates, and GraphSAGE can learn to balance these factors when making predictions.
GraphSAGE Algorithm [3]

The GraphSAGE algorithm loops for K iterations. In each iteration, it goes through each node, applies a message and aggregation function to the node based on its neighbors. Then, it normalizes the vectors. The one difference between our implementation and the above pseudocode is that we use a mean aggregator function in our GraphSAGE model, replacing lines 4 and 5 in the above pseudocode. It first concatenates the target node vector with each of the nodes in its neighborhood. Then it takes the element-wise mean of each vector it concatenates and multiplies them by the weight matrix W and applies a sigmoid function to the result.

Mean Aggregator Equation [3]

By using the GraphSAGE model to predict the expected flight cancellation percentage for a given airport, we can leverage the power of graph neural networks to analyze the complex relationships between airports and their surrounding environments. This can result in more accurate predictions and help stakeholders in the aviation industry to better anticipate and mitigate the impact of flight cancellations.

We demonstrate our GraphSAGE aggregation implementation:

# Self Created GraphSAGE CONV Model #

class MyGraphSageCONV(MessagePassing):

def __init__(self, in_channels, out_channels, normalize = True,
bias = False, **kwargs):
super(MyGraphSageCONV, self).__init__(**kwargs)

self.in_channels = in_channels
self.out_channels = out_channels
self.normalize = normalize

# Layers for message and update functions
self.lin_l = nn.Linear(in_features=in_channels, out_features=out_channels, bias=True)
self.lin_r = nn.Linear(in_features=in_channels, out_features=out_channels, bias=True)

self.reset_parameters()

def reset_parameters(self):
self.lin_l.reset_parameters()
self.lin_r.reset_parameters()

def forward(self, x, edge_index, size = None):
#Message passing and post-processing
out = self.lin_l(x)
x_prop = self.propagate(edge_index=edge_index, x=(x, x), size=size)
x_prop = self.lin_r(x_prop)
out = out + x_prop
if self.normalize:
out = F.normalize(out, p=2)
return out

def message(self, x_j):
#Message Function
out = x_j
return out

def aggregate(self, inputs, index, dim_size = None):
#Aggregate function
node_dim = self.node_dim
out = torch_scatter.scatter(src=inputs, index=index, dim=node_dim, dim_size=dim_size, reduce='mean')
return out
GraphSAGE Model Visualization: (https://snap.stanford.edu/graphsage/)

Using either the above implementation of GraphSAGE or Pytorch-Geometric’s SAGEConv class (with a mean aggregator), we can build a GNN by stacking two SAGEConv layers on top of each other followed by a 2-layer MLP to deepen the model without over-smoothing. Our GraphSAGE GNN also uses neighbor sampling [3] (using a (15, 5) sampling scheme for a 2-layer MLP) to further prevent over-smoothing and regularization (L2 norm and dropout [4]) to prevent overfitting.

Here is the code for our GraphSAGE implementation:

class GraphSAGENeighborSampling(torch.nn.Module):
"""GraphSAGE Model with Neighbor Sampling."""
def __init__(self, in_dim, hidden_dim, out_dim, dropout=0.1, weight_decay=5e-3, lr=1e-3):
super().__init__()
self.dropout = dropout

#GraphSAGE layers
self.conv1 = MyGraphSageCONV(in_dim, hidden_dim)
self.conv2 = MyGraphSageCONV(hidden_dim, hidden_dim)

# self.conv3 = SAGEConv(hidden_dim, hidden_dim) #uncomment for 3 layer GraphSAGE

self.convs = [self.conv1, self.conv2] # add self.conv3 for 3 layers GraphSAGE

#Post-processing MLP layer
self.post_mp = torch.nn.Sequential(
torch.nn.Linear(hidden_dim, hidden_dim),
torch.nn.Linear(hidden_dim, out_dim))


self.optimizer = torch.optim.Adam(self.parameters(),
lr=lr,
weight_decay=weight_decay)


def forward(self, x, adjs, train=True, return_embedding=False):
# Use Neighbor Sampling for Training
if train:
for i, (edge_index, _, size) in enumerate(adjs):
x_target = x[:size[1]]
x = self.convs[i]((x, x_target), edge_index)
x = F.elu(x)
x = F.dropout(x, p=self.dropout)
# Evaluate on Entire Graph
else:
x = self.conv1(x, adjs)
x = F.elu(x)
x = F.dropout(x, p=self.dropout)

x = self.conv2(x, adjs)
x = F.elu(x)
x = F.dropout(x, p=self.dropout)

#Save GraphSAGE embedding
embedding = x

x = self.post_mp(x)

if return_embedding:
return x, F.log_softmax(x, dim=-1), embedding
else:
return x, F.log_softmax(x, dim=-1)

We start with 31 features for each of the 61 nodes using a full batch. Here we can see a visualization of the the resulting dimensions after each step:

Why Not Use a GCN or GAT?

Our task involves inductively (as explained in the dataset section) and a graph convolutional network (GCN) [5] will not perform well because it requires seeing every node during training. We did not consider using a graph attention network (GAT) [6] because we want to explicitly tell the model which nodes are more important through the edge features rather than having the model implicitly learn attention weights.

Evaluation

Training

We use PyG’s NeighborSampler class as our data loader. NeighborSampler allows us to randomly sample different amounts of neighbors at different layers of the model. To evaluate the performance of the model, we use the coefficient of determination (R²) between our predicted cancellation rates and the true values given in the following formula:

Our training pipeline can be seen below where we use an Adam optimizer (defined in the model), mean squared error loss, weight decay, dropout, and early stopping (via recording the best model based on validation r²).

train_loader = NeighborSampler(data.edge_index, node_idx=data.train_mask,
sizes=[10, 5], batch_size=64,
shuffle=True, num_workers=1)

def train_gsns(model, data, nepochs=1000):
"""Train a GraphSAGE model with Neighbor Sampling and return the trained final model, best model, and recorded metrics."""
metrics = {"epoch": [], "train_r2": [], "train_loss": [], "val_r2": [], "val_loss": []}


criterion = torch.nn.MSELoss()
optimizer = model.optimizer
epochs = nepochs

best_val_r2 = -10
best_model = None

model.train()
for epoch in range(epochs+1):

ys = []
preds = []
total_loss = total_examples = 0.0

# Training
for batch_size, n_id, adjs in train_loader:
adjs = [adj.to(device) for adj in adjs]

optimizer.zero_grad()
out, log_out = model(data.x[n_id], adjs)

loss = criterion(out.flatten(), data.y[n_id[:batch_size]].flatten())
total_loss += loss
total_examples += out.numel()

ys.append(data.y[n_id[:batch_size]].flatten())
preds.append(out.flatten())

loss.backward()
optimizer.step()

# Compute Training Metrics
train_loss = total_loss / total_examples
preds = torch.cat(preds, dim=0)
ys = torch.cat(ys, dim=0)
train_r2, _, _ = calculate_metric(preds, ys)

# Compute Validation Metrics
test_out, test_logout = model(data.x, data.edge_index, train=False)
val_loss = criterion(test_out[data.val_mask].flatten(), data.y[data.val_mask])
val_r2, _, _ = calculate_metric(test_out[data.val_mask], data.y[data.val_mask])

# Print metrics every 100 epochs
if(epoch % 100 == 0):
print(f'Epoch {epoch:>3} | Train Loss: {train_loss:.3f} | Train r^2: '
f'{train_r2:>6.2f} | Val Loss: {val_loss:.2f} | '
f'Val r^2: {val_r2:.2f}')

# Save metrics for plotting
metrics["epoch"].append(epoch)
metrics["train_r2"].append(train_r2)
metrics["train_loss"].append(train_loss)
metrics["val_r2"].append(val_r2)
metrics["val_loss"].append(val_loss)

# Save best model
if epoch > int(epochs / 2) and val_r2 > best_val_r2: #epoch > int(epochs / 2) and
best_val_r2 = val_r2
best_model = copy.deepcopy(model)

return model, best_model, metrics

gsagens = GraphSAGENeighborSampling(data.num_features, 64, 1, dropout=0.1, weight_decay=5e-3, lr=1e-3).to(device)
gsagens, best_gsagens, metrics_gsagens = train_gsns(gsagens, data, nepochs=1000)

Results

We ran several experiments using different hyperparameters and our key findings can be seen in the table below:

For the GCN model, we see that despite having an increasing training r² score, it fails to generalize to unseen nodes in the validation and test set. We also see that prebuilt GraphSAGE layers from Pytorch-Geometric performed better than our self-built implementation due to better optimization schemes. Our best results, using the 2-Layer GraphSAGE with dropout, have an r² of 0.85, indicating a strong relationship between the model’s prediction and the correct labels. Thus, we have effectively demonstrated how the GraphSAGE model can predict airport cancellation rates.

We also see the model immediately over-smooths when a third GraphSAGE layer is added. It even happens when we apply neighbor sampling. This is confirmed by looking at the embeddings for all nodes which are extremely similar to each other. With our graph structure, almost all nodes are 2-hops away from each other due to large airport hubs hosting flights to and from a majority of airports. As such, a 2-layer model works well, but a 3-layer one predicts all the nodes to be the same.

We found that a 2-layer GraphSAGE model with neighbor sampling and regularization (weight decay and dropout) worked best. This is likely due to the previous models easily over-fitting to our training examples as well as preventing nodes with large overlapping neighborhood structures (eg. the two airports in Hawaii) from developing the same embedding.

Here are our training graphs for this model:

Visualization

After training the model, we can compare the 2D TSNE embeddings [7] of our input feature vectors with our trained GraphSAGE embeddings. We will use sklearn’s manifold.TSNE to do this:

# Plot TSNE for Trained Embeddings #

# Find TSNE Embeddings
tsne = TSNE(2, perplexity=15, verbose=3, n_iter=2000, n_iter_without_progress=300)
tsne_proj = tsne.fit_transform(gsagens_embeddings)
tx, ty = tsne_proj[:,0], tsne_proj[:,1]


# Normalize labels to [0.2, 1.0] for visualization
max_y = data.y.max().detach().cpu().numpy()
min_y = data.y.min().detach().cpu().numpy()

data_y = data.y.detach().cpu().numpy()

t_max = 1.0
t_min = 0.2

min_max_y = (data_y - min_y) / (max_y - min_y) * (t_max - t_min) + t_min
min_max_y_train = min_max_y * data.train_mask.detach().cpu().numpy()
min_max_y_val = min_max_y * data.val_mask.detach().cpu().numpy()
min_max_y_test = min_max_y * data.test_mask.detach().cpu().numpy()

color_map = mpl.colormaps['Reds']
node_color = [color_map(label) for label in min_max_y]

node_aiport = [id_to_airport[airportid] for airportid in range(len(min_max_y))]

# Plot TSNE with label as color
fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(tx, ty, c=node_color, alpha=0.5)
for i, airport in enumerate(node_aiport):
if i in label_nodes:
ax.annotate(airport, (tx[i], ty[i]))
plt.show()

The TSNE embeddings are colored based on their label with darker colors indicating a large percent of canceled flights. We have also labeled a few airports with their FAA code for clarity and comparison between the two TSNE plots.

TSNE Feature Vector Projections
TSNE Projected GraphSAGE Embeddings

In the TSNE embedding of the trained GraphSAGE embeddings, we can see that there is a clear gradient in colors signaling that the model is learning clear relationships between the node properties and their labels (equivalently, if we define bins/thresholds for the labels to turn this into a classification problem, we would see clusters form based on class labels). On the other hand, the TSNE embedding for the input feature vectors show less clear patterns.

Flight Route Optimization

To give an example for the usefulness of our predictions, we demonstrate an algorithm to give the route between 2 different airports with minimum likelihood of cancellation. The algorithm assumes independence between cancellations at each airport. First, we construct a directed graph of all the airports and flights between them as nodes and edges respectively. Our goal is to find the minimum path between our source vertex and destination with the edge weights being the probability that our flight is canceled. Normally, this problem would be solved with a standard shortest path algorithm, such as Dijkstra’s algorithm. This generally works, but we need to make a modification as we are not adding the edge weights together in a given path, but multiplying them because we are dealing with probabilities. Because the logarithm is a monotone function and

we can replace each weight a with log(a) in Dijkstra’s algorithm in order to find the optimal path. Here is a toy example:

Suppose we start in Portland (PDX) and need to get to Chicago (ORD) and there aren’t any direct flights that day. Our options are first flying to Salt Lake City (SLC) or San Francisco (SFO) and then from either of those to Chicago. Assuming equal probability of incoming and outgoing flights being canceled, suppose the model predicts that there is a 3% chance of cancelation at PDX, 4% at SFO, 4.3% at SLC, and 1% at ORD. Then the path through SLC would be log(0.03 · 0.043) + log(0.043 · 0.01) = -6.256 and the path through SFO would be log(0.03 · 0.04) + log(0.04 · 0.01) = -6.319. The SFO path has a lower score and thus a lower probability of cancellation, and so we should pick that path. This example may seem rather obvious with such a short path and so few flight options, but the benefits become more pronounced if we were doing say worldwide cargo shipping logistics with far more flight options and multiple possible starting and ending locations.

Flying from PDX to SFO is better than to SLC

Conclusion

Our results demonstrate that graph neural networks can be successfully applied to airport and flight data. We have effectively created a GraphSAGE model which is able to predict the percent of flight cancellations at a given airport and time. From this, we were able to show the importance of our results in real world logistics scenarios. We hope that this will serve as a valuable resource for training and visualizing the results of graph neural networks.

In the future, we would like to expand the dataset to cover the full US as well as international airports. This would allow us to experiment with transductive models, create deeper networks, and work with truly random splits. We would also like to see if our approach works with other labels of interests such as delays. On the graph structure side, we would like to explore modeling this problem as a heterogeneous graph where flights and airports are both nodes so that we can use link prediction to predict where flights are coming from or going to. Finally, we would like to explore the impact of making architectural changes such as changing the aggregation function.

References

[1] OpenFlights Data Source. Retrieved from https://openflights.org/data.html.

[2] Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428.

[3] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1025–1035.

[4] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929–1958.

[5] Zhang, H., Lu, G., Zhan, M. et al. Semi-Supervised Classification of Graph Convolutional Networks with Laplacian Rank Constraints. Neural Process Lett 54, 2645–2656 (2022). https://doi.org/10.1007/s11063-020-10404-7

[6] How Attentive are Graph Attention Networks?, Shaked Brody, Uri Alon, Eran Yahav, arXiv:2105.14491, 2021

[7] Wattenberg, et al., “How to Use t-SNE Effectively”, Distill, 2016. http://doi.org/10.23915/distill.00002

--

--