Unfriendly Skies: Predicting Flight Cancellations Using Weather Data, Part 4

Tim Bohn
Inside Machine learning
8 min readNov 8, 2018

--

Tim Bohn and Ricardo Balduino

Introduction

In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict flight cancellations. Specifically, we hoped to predict the probability of the cancellation of flights between the ten U.S. airports most affected by weather. We used historical flight data and historical weather data to make predictions for upcoming flights.

In Part 2, we started our exploration with IBM SPSS Modeler and in Part 3 looked at using a python Jupyter notebook in IBM Watson™ Studio to do the same thing.

With SPSS Modeler starting to be integrated into Watson Studio, we thought it was time to check out how easy it would be to migrate what we did in SPSS Modeler into a “Modeler flow” in Watson Studio. For this, we tried out the new IBM Watson Studio Desktop, whose Beta version came out in mid-October 2018.

Tools used in this use case solution

IBM Watson Studio is a cloud data science/machine learning platform that makes it easier for data scientists, data engineers, developers and domain experts to prepare their data for data science and AI, regardless of their expertise. Then the platform helps them to work together to build, train, deploy and manage custom machine learning and deep learning models, all in one environment. Watson Studio provides a range of integrated frameworks and tooling for different skill levels, from programming environments to visual, drag and drop GUIs. It’s built on open source tools like Jupyter Notebooks, python, R, Scala, Anaconda and RStudio. Also, as mentioned, SPSS Modeler has begun to be integrated into Watson Studio. There is also an on-premise version called Watson Studio Local for times when your data has to stay behind your firewall. This version also has Decision Optimization tooling.

Now in addition, IBM has introduced a new product called IBM Watson Studio Desktop to address the needs for data prep and modeling on an individual’s desktop. It’s currently in Beta, and the first release has the SPSS Modeler built in. We decided to try that one to see how it worked compared to the SPSS Modeler we used in the beginning. Figure 1 shows what the project looks like. It is very much the same as the other two Watson Studio flavors.

Figure 1 — IBM Watson Studio Desktop Project

SPSS in Watson Studio Desktop

Flights data — In the beginning of this project we gathered 2016 flight data from the US Bureau of Transportation Statistics website. The website allowed us to export one month at a time, so we ended up with 12 csv (comma separated value) files. We first merged all the csv files into one set and then selected the ten airports in our scope. We did some data clean-up and formatting to validate dates and hours for each flight, as seen in Figure 1. Here we had to add a couple nodes because the Watson Studio SPSS read a column of the input data as an integer, where SPSS Modeler read it as a String. So, we have a couple new nodes to do some conversion.

Figure 2 — Gathering and preparing flights data in Watson Studio SPSS

Weather data — SPSS Modeler had an extension node from the SPSS Extensions Hub named TWCHistoricalGridded. That node took a csv file listing the 10 airports latitude and longitude coordinates as input and generated the historical hourly data for the entire year of 2016, for each airport location. This is one thing Watson Studio’s SPSS doesn’t have yet. So, we used the data generated by SPSS Modeler as input. Other than running that Extension node, this part was the same in Watson Studio as it was in SPSS Modeler.

Figure 3 — Gathering and preparing weather data in IBM SPSS

Combined flights and weather data — To each flight in the first data set, we added two new columns: ORIGIN and DEST, containing the respective airport codes. Next, we merged the flight data and the weather data. Note: the “stars” or SPSS super nodes in Figure 3 are placeholders for the diagrams in Figures 1 and 2 above.

Figure 4 — Combining flights and weather data in IBM Watson Studio SPSS

Data preparation, modeling, and evaluation

In the beginning, we iteratively performed the following steps until the desired model qualities were reached:

  • Prepare data
  • Perform modeling
  • Evaluate the model
  • Repeat

Figure 4 shows the first and second iterations of our process in IBM Watson Studio SPSS.

Figure 5 — Iterations: prepare data, run models, evaluate — and do it again

First iteration

To start preparing the data, we used the combined flights and weather data from the previous step and performed some data cleanup (e.g. took care of null values). In order to better train the model later on, we filtered out rows where flight cancellations were not related to weather conditions (e.g. cancellations due to technical issues, security issues, etc.).

This is an interesting use case, and often a hard one to solve, due to the imbalanced data it presents, as seen in Figure 5. By “imbalanced” we mean that there were far more non-cancelled flights in the historical data than cancelled ones. We will discuss how we dealt with imbalanced data in the following iteration.

Figure 6 — Historical data: distribution of cancelled and non-cancelled flights in the input data

Next, we defined which features were required as inputs to the model (such as flight date, hour, day of the week, origin and destination airport codes, and weather conditions), and which feature was the target that the model should generate (i.e. predicting the cancellation status). We then partitioned the data into training and testing sets, using an 85/15 ratio.

We fed the partitioned data into an SPSS node called Auto Classifier. This node allowed us to run multiple models at once and preview their outputs, such as the Accuracy, as seen in Figure 6.

Figure 7 — Models output provided by the “Auto Classifier” node

That was a useful step in making an initial selection of a model for further refinement during subsequent iterations. We decided to use the Random Trees model since the initial analysis showed it has the best area under the curve as compared to the other models in the list.

Second iteration

During the second iteration, we addressed the skewedness of the original data. For that purpose, we chose one of the SPSS nodes called SMOTE (Synthetic Minority Over-sampling Technique). This node provides an advanced over-sampling algorithm that deals with imbalanced datasets, which helped our selected model work more effectively. In Figure 7, we notice a more balanced distribution between cancelled and non-cancelled flights after running the data through SMOTE.

Figure 8 — Distribution of cancelled and non-cancelled flights after using SMOTE

As mentioned earlier, we picked the Random Trees model for this sample solution. This SPSS node provides a model for tree-based classification and prediction that is built on Classification and Regression Tree methodology. Due to its characteristics, this model is much less prone to overfitting, which gives a higher likelihood of repeating the same test results when you use new data, that is, data that was not part of the original training and testing data sets. Another advantage of this method — in particular for our use case — is its ability to handle imbalanced data.

Since in this use case we are dealing with classification analysis, we used two common ways to evaluate the performance of the model: confusion matrix and ROC curve. One of the outputs of running the Random Trees model in SPSS is the confusion matrix seen in Figure 9. The table shows the precision achieved by the model during training. In this case, the model’s precision was about 91% for predicting cancelled flights (true positives), and about 92% for predicting non-cancelled flights (true negatives). That means, the model was correct most of the time, but also made wrong predictions about 8–9% of the time (false negatives and false positives).

Figure 9 — Confusion Matrix: model’s precision for cancelled vs. non-cancelled flights during training
Figure 10 — ROC curves for the training and testing datasets

This precision is what the model gave us for the training data set. This is also represented by the ROC curve on the left side of Figure 10. We can see, however, that the area under the curve for the training data set was better than the area under the curve for the testing data set (right side of Figure 10), which means that during testing, the model did not perform as well as during training (that is, it presented a higher rate of errors, or higher rate of false negatives and false positives). Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to further refine this model or even to use other models that could solve this use case.

Deploying the model

Deployment from Watson Studio Desktop hasn’t come to the beta product just yet, but the development team told us that we can export the stream file and import that into Watson Studio Cloud or IBM Cloud Private for Data and deploy from there.

Conclusion

IBM Watson Studio Desktop SPSS is yet another powerful tool that helped us visually create a solution for this use case without writing a single line of code. We were able to follow an iterative process that helped us understand and prepare the data, then model and evaluate the solution. Currently we have to export/import and deploy from one of IBM’s current cloud solutions to get an API for consumption by application developers, but we’ve been told that’s coming.

Resources

The IBM SPSS Modeler stream and data used as the basis for this blog are available here: https://github.com/IBM-DSE/flights-cancellation

--

--

Tim Bohn
Inside Machine learning

Data Scientist and Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries) and Technology. Tweets are personal opinions.