Clear for Takeoff: A Naive Bayes Approach to Flight Delay Predictions

Published in

AI Odyssey

11 min readNov 15, 2023

This article draws inspiration from Arthur Hailey’s book ‘Airport,’ using its narrative essence to explore the application of Naive Bayes in predicting flight delays.

Imagine the following scenario: you are General Manager Mel Bakersfield of the large metropolitan (and fictional) Lincoln International Airport in Chicago, and the snowstorm outside is wreaking havoc on the airport operations. Today, it seems fortune has turned away from the employees of the airport, as you have run into a multitude of problems — the unexpected closure of an important runway, which in turn caused a midair emergency in another aeroplane, protests against planes flying over the neighbouring noise-sensitive suburb by its residents. Additionally, your marriage to your wife Cindy is crumbling before your eyes. Needless to say, you are under an overwhelming amount of pressure, and you need to make decisions to keep the airport running.

Now that we have established the setting, we can move on to a short introduction to the statistical framework of the article. To understand the Naive Bayes Classifier, we first must know about Bayes’ Theorem.

Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. [1] It is mathematically expressed in the following way:

Where:

P(A|B): the posterior probability, i.e. the conditional probability of A occurring given B is true.

P(B|A): the conditional probability of B occurring given A is true, also interpreted as the likelihood of A given a fixed B.

P(B) and P(A): the probabilities of observing A and B, regardless of conditions, also known as the prior and marginal probabilities.

Let’s visualise this by applying Bayes’ Theorem to our scenario for clarification. Let P(A) be the probability of observing three flight delays at Lincoln Airport, and P(B) be the probability of observing a blizzard.

Thus, per Bayes’ Theorem, the probability of observing delays at Lincoln Airport given there is a blizzard, is equal to the probability of there being a snowstorm given there are three delays, times the marginal probability of three delays, divided by the probability of observing a blizzard. Straightforward, right?

How would you, being a brilliant statistician and an experienced specialist, conduct a risk assessment and evaluate the probability of three flights being delayed in the case of a blizzard occurring?

A snowstorm, to qualify as a blizzard, has to meet the following three criteria:

1. Sustained wind or frequent gusts of 35mph or greater.

2. Visibility to under a quarter mile.

3. These conditions have to last for at least three consecutive hours. [2]

Let these be our independent variables. Consider, in our fictional scenario, we have a fictional dataset (Figure 1) that describes the weather conditions for flights.

Now, we can calculate the marginal and conditional probability and thus apply Bayes’ Theorem.

Now that we have the necessary probabilities, we can calculate the probability of observing three flights delayed given there is a blizzard.

Thus, we can conclude that there is a 50% probability that flights will be delayed in the case of a blizzard. How is this useful information for you, Mr Bakersfield? There is a multitude of reasons: you can re-evaluate the blizzard policies of the airport, including the number of runways for departing aircraft, the scale of the snow-cleaning crew, and a variety of other major decisions that you can make now to prevent a slowdown in the case of severe weather conditions, which would otherwise be a very costly event both for the airport and the airlines.

Now that we have cultivated some basic understanding of Bayes’ theorem, we can move on to the algorithm itself, the Naive Bayes Classifier.

The Naive Bayes Classifier uses Bayes’ Theorem in the following form:

Where:

P(Y|X) is the posterior probability of the class given the features.

P(X|Y) is the likelihood of the features given the class.

P(Y) is the prior probability of the class.

P(X) is the prior probability of the features.

Features (X): Wind, Visibility, Consecutive Hours

X = {Wind, Visibility, Consecutive Hours}

Y = {Delayed or Not Delayed}

Naive Bayes assumes that the features (Wind, Visibility, Consecutive Hours) are conditionally independent given the class label (Flights Delayed).

Calculation Steps:

- Calculate the prior probabilities P(Y) and P(X) from the dataset.

- Calculate the likelihood P(X|Y) for each combination of features given the class label.

- Substitute these values into the Naive Bayes formula to calculate the posterior probability P(Y|X).

Application to a New Instance:

Suppose you have a new set of weather conditions:

X_{new} = {Wind = TRUE, Visibility = TRUE, Consecutive Hours = FALSE}

You can use the Naive Bayes Classifier to predict whether three flights will be delayed (Y = 1) or not delayed (Y = 0) based on these conditions.

How can we code this?

For demonstration, we will use this dataset of 2019 airline delays with weather and airport details.

Note: all the code is stored in this repository.

Firstly, we import all the necessary libraries and define the data frame (df). I used the file ‘train.csv’.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt

df = pd.read_csv('train.csv)

The target variable of this dataset is ‘DEP_DEL15’, which is whether the departure will be delayed by over 15 minutes.

Upon inspection, I noticed there were some categorical variables, which I encoded using LabelEncoder() from sklearn.preprocessing. This is to ensure that all columns have a numerical data type. Furthermore, I changed the ‘DEP_TIME_BLK’, departure time block, to the starting hour of the block.

columns_to_encode = ['CARRIER_NAME', 'DEPARTING_AIRPORT', 'PREVIOUS_AIRPORT']

# Create an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data for each column
for column in columns_to_encode:
    df[column] = label_encoder.fit_transform(df[column])

# Alternatively
# df[columns_to_encode] = df[columns_to_encode].apply(lambda col: label_encoder.fit_transform(col))

# changing dep_time_blk to starting time
df['DEP_TIME_BLK'] = df['DEP_TIME_BLK'].str.split('-').str[0].astype(int) // 100

After checking for the percentage of missing values in each column, I discovered there were none in all of the columns.

percent_missing = df.isna().sum() / df.shape[0]
percent_missing.sort_values(ascending=False)

MONTH                            0.0
DAY_OF_WEEK                      0.0
DAY_HISTORICAL                   0.0
DEP_AIRPORT_HIST                 0.0
CARRIER_HISTORICAL               0.0
AWND                             0.0
TMAX                             0.0
SNWD                             0.0
SNOW                             0.0
PRCP                             0.0
PREVIOUS_AIRPORT                 0.0
LONGITUDE                        0.0
LATITUDE                         0.0
DEPARTING_AIRPORT                0.0
PLANE_AGE                        0.0
GROUND_SERV_PER_PASS             0.0
FLT_ATTENDANTS_PER_PASS          0.0
AVG_MONTHLY_PASS_AIRLINE         0.0
AVG_MONTHLY_PASS_AIRPORT         0.0
AIRLINE_AIRPORT_FLIGHTS_MONTH    0.0
AIRLINE_FLIGHTS_MONTH            0.0
AIRPORT_FLIGHTS_MONTH            0.0
CARRIER_NAME                     0.0
NUMBER_OF_SEATS                  0.0
CONCURRENT_FLIGHTS               0.0
...
DISTANCE_GROUP                   0.0
DEP_TIME_BLK                     0.0
DEP_DEL15                        0.0
DEP_BLOCK_HIST                   0.0

EDA (Exploratory Data Analysis)

I created heatmaps to demonstrate the correlation between the variables (Figure 3) and the correlation between the target variable, ‘DEP_DEL15’, and the features with which it has the 10 highest correlation coefficients (Figure 2).

Figure 2. Ten highest correlation coefficients heatmap between target variable and features

There were no features that had a strong or even moderate correlation with the target variable.

Next, we split the dataset into training-testing sets with the ratio 80–20 using train_test_split from sklearn.model_selection.

X = df.drop('DEP_DEL15', axis=1)
y = df['DEP_DEL15']

#split data into train-test 80-20
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

For this dataset, as the model, I have decided to implement Gaussian Naive Bayes and Complement Naive Bayes, and then compare the accuracy of the models.

Gaussian Naive Bayes is a probabilistic classification algorithm suitable for continuous numerical features and is based on applying Bayes’ Theorem with strong independence assumptions, which means that the presence of one feature does not influence the presence of another. [3] We use Gaussian Naive Bayes when we assume all continuous variables associated with each feature are normally distributed.

#Initialise the model
model = GaussianNB()

model.fit(X_train, y_train)

#Make the prediction
y_pred_gnb = gnb.predict(X_test)

#Use confusion matrix and classification report to check the model's performance
conf_matrix = confusion_matrix(y_test, y_pred_gnb)

# Display the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

# Visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, 
            xticklabels=['Not Delayed', 'Delayed'], yticklabels=['Not Delayed', 'Delayed'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_gnb))

Figure 4. Confusion Matrix for Gaussian NB

How do we interpret our results?

Figure 5 is a confusion matrix for a binary classification problem.

A good model has high True Positive (TP) and True Negative (TN) rates, with low False Positive (FP) and False Negative (FN) rates. In our case (Figure 4), we have high TP and FN rates, which means that all the instances which belong to class 0 were predicted correctly and none of the instances which belong to class 1 were predicted correctly.

Figure 6. Classification report for Gaussian NB

The classification report (Figure 6) shows that the model performs well in identifying instances without flight delays (Class 0), achieving high precision, recall, and an F1-score of 81%, 100%, and 90%, respectively. However, it struggles to predict instances with flight delays (Class 1), with all metrics at 0%, indicating a lack of true positive predictions.

While the overall accuracy is 81%, it might be misleading due to the class imbalance. The model’s strength lies in handling non-delay instances but falls short when it comes to predicting delays.

Our second NB model is Complement Naive Bayes, which is well-suited for unbalanced datasets. In our case, the dataset is highly unbalanced, which I checked with the value_counts function.

class_counts = df['DEP_DEL15'].value_counts()

# Display the value counts
print(class_counts)

We obtain that in the ‘DEP_DEL15’ column, there are 3,683,185 instances of 0 and 859,158 instances of 1.

Complement Naive Bayes is designed to address the issue of overfitting to the majority class by considering the complementary information of each feature with respect to the target class.

Since there were negative values in the Longitude and Maximum Temperature columns, which could not be used for Complement Naive Bayes, I performed a zero-shift.

# Identify features with negative values
features_with_negatives = ['LONGITUDE', 'TMAX']

# Apply zero shift to features
for feature in features_with_negatives:
    min_value = X_train[feature].min()
    if min_value < 0:
        X_train[feature] = X_train[feature] - min_value

# Apply the same transformation to the test set if needed
for feature in features_with_negatives:
    min_value = X_test[feature].min()
    if min_value < 0:
        X_test[feature] = X_test[feature] - min_value

Now, we can implement Complement Naive Bayes.

cnb = ComplementNB()
cnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred_cnb = cnb.predict(X_test)

# Evaluate the model
conf_matrix2 = confusion_matrix(y_test, y_pred_cnb)

# display the confusion matrix
print("Confusion Matrix:")
print(conf_matrix2)

# visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix2, annot=True, fmt='d', cmap='Blues', cbar=False, 
            xticklabels=['Not Delayed', 'Delayed'], yticklabels=['Not Delayed', 'Delayed'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_cnb))

As the output we obtain the following:

Figure 7. Confusion Matrix for Complement Naive Bayes

Figure 8. Classification Report for Complement Naive Bayes

The model yields an overall accuracy of 51%, indicating performance better than random guessing but with room for improvement. Notably, it demonstrates strong precision (82%) in predicting instances without flight delays (Class 0), although its recall (51%) suggests it captures only half of such instances. The F1-score for Class 0 stands at 63%, reflecting a balanced measure. However, the model struggles in predicting flight delays (Class 1), with lower precision (20%) and an F1-score of 28%. The macro average suggests moderate performance, while the weighted average emphasizes the model’s proficiency in handling non-delay instances due to their higher representation in the dataset. (Figure 8)

Per the confusion matrix (Figure 7), the substantial number of false positives and false negatives suggests challenges in accurately distinguishing between the two classes. While the model demonstrates proficiency in identifying non-delay instances, it struggles to effectively capture instances with flight delays.

Comparison of models

While GNB outperforms CNB in overall accuracy and precision for non-delay instances, it exhibits challenges in predicting instances with flight delays. CNB, on the other hand, achieves a more balanced performance across both classes but with lower overall accuracy.

Advantages and Disadvantages of Naive Bayes

Naive Bayes is advantageous for its simplicity, efficiency, and scalability, making it suitable for high-dimensional datasets and categorical features. However, its reliance on the assumption of feature independence and sensitivity to input quality may limit its performance, especially in scenarios with correlated features or imbalanced data.

Applications of Naive Bayes and Conclusion

Naive Bayes is employed in a wide variety of real-life applications, including spam filtering, sentiment analysis, and recommendation systems. The model’s efficiency in handling multiple features and quick training makes it suitable for scenarios where rapid predictions are crucial, such as our scenario with the blizzard and the delays.

In conclusion, our exploration of Naive Bayes, inspired by the challenges faced by General Manager Mel Bakersfield at Lincoln International Airport, has shed light on the versatility and practicality of this algorithm. By applying Bayes’ Theorem to assess the impact of blizzards on flight delays, we demonstrated how Naive Bayes can be a valuable tool for decision-making. The implementation of both Gaussian Naive Bayes and Complement Naive Bayes on a real-world dataset further highlighted their strengths and weaknesses. While Gaussian Naive Bayes exhibited higher overall accuracy and precision for non-delay instances, Complement Naive Bayes demonstrated a more balanced performance across both classes. Overall, Naive Bayes proves valuable in scenarios requiring rapid predictions, making it a noteworthy algorithm in the realm of machine learning.

Footnotes

[1]. Joyce, James (2003), “Bayes’ Theorem”, in Zalta, Edward N. (ed.), The Stanford Encyclopedia of Philosophy (Spring 2019 ed.), Metaphysics Research Lab, Stanford University, retrieved 2023–11–14

[2]. Criteria for winter storm watches/warnings and winter weather advisories. (n.d.)., retrieved 2023–11–14, https://www.weather.gov/media/meg/WinterStormCriteriaMEG.pdf

[3]. Gaussian Naive Bayes: What You Need to Know? | upGrad blog. (n.d.). Gaussian naive bayes: What you need to know?. upGrad blog., retrieved 2023–11–14, https://www.upgrad.com/blog/gaussian-naive-bayes/

Bibliography

Joyce, James (2003), “Bayes’ Theorem”, in Zalta, Edward N. (ed.), The Stanford Encyclopedia of Philosophy (Spring 2019 ed.), Metaphysics Research Lab, Stanford University, retrieved 2023–11–14

Criteria for winter storm watches/warnings and winter weather advisories. (n.d.)., retrieved 2023–11–14, https://www.weather.gov/media/meg/WinterStormCriteriaMEG.pdf

Gaussian Naive Bayes: What You Need to Know? | upGrad blog. (n.d.). Gaussian naive bayes: What you need to know?. upGrad blog., retrieved 2023–11–14, https://www.upgrad.com/blog/gaussian-naive-bayes/

Hailey, A. (2015). Airport. Ishi Press International.

Awan, A. A., & Navlani, A. (2023, March 3). Naive Bayes classifier tutorial: With Python Scikit-Learn. DataCamp. https://www.datacamp.com/tutorial/naive-bayes-scikit-learn

GeeksforGeeks. (2023, September 13). Naive Bayes classifiers. https://www.geeksforgeeks.org/naive-bayes-classifiers/

Wadkins, J. (2022, January 17). 2019 airline delays w/weather and airport detail. Kaggle. https://www.kaggle.com/datasets/threnjen/2019-airline-delays-and-cancellations?rvi=1

Clear for Takeoff: A Naive Bayes Approach to Flight Delay Predictions

How can we code this?

EDA (Exploratory Data Analysis)

Written by Viktoria Aghabekyan