Unfriendly Skies: Predicting Flight Cancellations Using Weather Data, Part 1

Tim Bohn
Inside Machine learning
5 min readAug 7, 2017

Tim Bohn and Ricardo Balduino

Wright Brothers, Creative Commons

This is the first in a series of blog posts where we’ll explore a use case and a few different machine learning platforms to see how we might build a model, using platforms that can help predict flight cancellations. In part one, we’ll talk about the use case, how and why we limited the scenario, and about the data we gathered to start the data science / machine learning process.

Use Case

For our use case, we chose flight cancellations and weather data for a few different reasons. We wanted a project that…

• Would already have reasonably large amount of data, but not so much that we’d need more than just our laptop to do the data processing.

• Would require federating data from more than just one source.

Would require the various steps in a real data science/machine learning project. CRISP-DM is one such process.

Wikimedia Commons license: Kenneth Jensen

Many people think that “training” a model is all that a machine learning project consists of. It doesn’t take much reading about data science to know that things like Data Collection, Data Preparation, Data Exploration and Data Engineering can take up the largest amount of your time on such a project. So, we wanted a use case and data set that required all of this.

And so, we decided to see how well we can predict airline flight cancellations if we include weather data with historical flight data. This required all of the things we were looking for, but also ended up including something we didn’t think about ahead of time: the fact that this data is heavily imbalanced. Specifically, of all the flights in the data set only a small percentage are actually cancelled.

This fact pushed us to a greater understanding of how to deal with heavily imbalanced classes in our data. First is that for this problem, “accuracy” is a terrible measure. Just predicting that the flight won’t be cancelled will give us great accuracy, but isn’t a good model. We needed to look to measures like the Confusion Matrix, Precision, Recall and ROC Curves. Next, we wanted to try different algorithms and techniques like oversampling and undersampling, penalizing wrong classification of our rare class, and a few other things like the SMOTE algorithm. Heavily imbalanced data makes analysis difficult, but we realized that it’s also pretty common with real-life scenarios.

Daniel Schwen/Wikimedia

Limiting the Scope

We decided that running our analysis for every airport in the world would be too big in scope. Even limiting to airports in the USA would be more than needed for our project. So we decided to limit to the top 10 airports most affected by weather. That left us with a manageable amount of data — and we suspected the data itself would be less imbalanced. A quick search gave us this site, 10 Most Weather-Delayed U.S. Major Airports, and the 10 airports we would use.

Data Collection

To get US flight data, we used this United States Department of Transportation site, whose filters let us isolate the features we wanted. Unfortunately, the site can only deliver data for a particular month at a time. So, we had to gather twelve separate files for the twelve months of 2016, which increased the complexity of the data engineering since we had to first merge the 12 data files and then filter out all but the 10 desired airports. Not difficult, but a real-world task. The twelve files held over 5 million records so it wasn’t something that could be done in Excel.

Next, we used The Weather Company API to get historical weather data for those 10 airport sites, for 2016. Our plan was to combine these two data sources as a part of the data preparation and data engineering.

Chin Tin Tin/Wikipedia/Creative Commons

Goals

Our goal for this use case was to come up with an exercise for creating a machine learning model using a few different platforms.

In the next post in the series, we’ll use IBM’s SPSS Modeler, which is ideal for beginners because of its visual graphical interface, many different machine learning algorithms including one that finds the best machine learning algorithm to use, and easy ways to explore, prepare and transform data.

In the third post, we’ll try replicating our efforts using IBM’s Watson Studio Cloud platform with Watson Machine Learning (WML). Creating a Jupyter notebook using the Python programming language might give us more flexibility in code versus the GUI interface in SPSS. Admittedly, it’s also likely a harder task if you’re not a wizard with Python, so it could take a bit longer. WML is still in beta but we’ll see what it can do.

For the final post, we’ll try converting the SPSS model we did first into a “flow” — which is a new capability coming soon to IBM’s Watson Studio, which provides SPSS Modeler capabilities directly within Watson Studio. Trying to recreate our original SPSS model using a flow in the cloud should prove interesting.

To be clear, we aren’t trying to create a production quality model. That would require a lot more work and time. Instead we want to create something that works reasonably well and that can be done using the different platforms described. At the same time, with a little more work and expertise, the project could possibly be tuned to the point of being production quality. If so, we can imagine hotels using it to generate real-time advertising in the airports where it predicts flights will be cancelled. Or Uber might use it to gear up more cars for stranded passengers. Or perhaps the airport itself could use the model to prepare better for cancellations, and offer better experiences for flyers. Let us know of any other ideas that come to mind.

Continue to part 2 where we’ll talk about our process for IBM SPSS Modeler.

--

--

Tim Bohn
Inside Machine learning

Data Scientist and Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries) and Technology. Tweets are personal opinions.