Predicting Flight Delays
Flight delays have become an important subject and problem for air transportation systems all over the world. The aviation industry is continuing to suffer from economic losses associated with flight delays all the time. According to data from the Bureau of Transportation Statistics (BTS) of the United States, more than 20% of U.S. flights were delayed in 2018. These flight delays have a severe economic impact in the U.S. that is equivalent to 40.7 billion dollars per year. Passengers suffer a loss of time, missed business opportunities or leisure activities, and airlines attempting to make up for delays leads to extra fuel consumption and a larger adverse environmental impact. In order to alleviate the negative economic and environmental impacts caused by unexpected flight delays, and balance increasing flight demand with growing flight delays, an accurate prediction of flight delays in airports is needed.
Airport delays may result from airlines operations, air traffic congestion, weather, air traffic management initiatives, etc. Most of the reasons are stochastic phenomena which are difficult to predict timely and accurately.
The goal of this project is to develop a computational model for predicting the delays based on data for flights extracted from Kaggle.
The first phase is getting data from Kaggle and stores it into PostgreSQL. Second phase is data cleaning. After loading data into database, I cleaned the data mainly depend on business needs. After cleaning all data, next phase is feature engineering, where you create features for machine learning model from raw data. Fourth phase is exploratory data analysis. In this phase I create graphics to understand data. Fifth phase is model analysis, where I applied machine learning algorithms on dataset.
Storing data into PostgreSQL
All US airlines flights data for 2018 were obtained from Kaggle. As of last count, we have over 7 million rows of on-time performance data stored in a PostgreSQL table that is accessible from jupyter notebook.
This dataset has detail info for airlines, airport, flight number etc. Pretty much all other data is time-related in minutes. It also has the delays broken out by type — like carrier, weather, NAS, security and late aircraft.
Data Cleaning
At the beginning my dataset had over 7 million flight information. Then I recognize that there were many flights which don’t have all data available. So unavailability of features was the main reason behind eliminating flights from my dataset.
Canceled flights are not delayed flights. If it is canceled that means the flight didn’t happen and values are not helping. I filtered out Canceled flights for my analysis.
I deleted the null values in the column actual_elasped_time I intend to use in the future.
After removing those flights, I finally got my dataset with around 6 million data which have all information available.
Feature Engineering
Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.
Label column “delayed” is created, and value is set to 1 if any delays type columns have a value — like carrier, weather, NAS, security and late aircraft.
Airlines code (op_carrier) and Airport code (origin/dest) columns are converted from “Object” to “int64”.
Flight date column fl_date is separated into fl_month, fl_day, and data is updated into these columns from fl_date.
Exploratory Data Analysis (EDA)
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.
A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a dataset. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a dataset we can call <dataset>.corr()
, which is a Pandas dataframe method. This will give us the correlation matrix.
When we look at the image of the data grouped by the target column delayed, we can see that the data is not distributed well balanced. There are 19% of the values are delayed and 81% of the values are not delayed.
The Machine Learning techniques such as Decision Tree and Logistic Regression have a bias towards the majority class. I used Random Under Sampling algorithm that the majority class has been reduced to the total number of the minority class for handling imbalanced class distribution. Hereby, both classes will have an equal number of entries.
Model Analysis
In order to run ML algorithms more easily in local, I limited the number of my data to 1 million and decided on my feature columns.
I used 10 algorithms with ensemble models such as Decision Tree, Random Forest, Bagging, KNN etc. I compared train and test accuracy scores, precision, recall, and F1 scores.
According to the list, the best model was Random Forest. Next, I examined whether I can optimize succeeded models using grid search and randomized search.
After the parameter estimation, I combined other models to optimized models in a table. I plotted a ROC Curve with the AUC scores below.
Conclusion
Finally, when we look at the ROC Curve, we can say that Random Forest Classifier will bring us the most accurate results.
GitHub repository for web scraping and data preprocessing is here.
Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.