Unfriendly Skies: Predicting Flight Cancellations Using Weather Data, Part 2

Ricardo Balduino

Published in

Inside Machine learning

8 min readAug 14, 2017

Ricardo Balduino and Tim Bohn

Introduction

As we described in Part 1 of this series, our objective is to help predict the probability of the cancellation of a flight between two of the ten U.S. airports most affected by weather conditions. We use historical flights data and historical weather data to make predictions for upcoming flights.

Over the course of this four-part series, we use different platforms to help us with those predictions. Here in Part 2, we use the IBM SPSS Modeler and APIs from The Weather Company.

Tools used in this use case solution

IBM SPSS Modeler is designed to help discover patterns and trends in structured and unstructured data with an intuitive visual interface supported by advanced analytics. It provides a range of advanced algorithms and analysis techniques, including text analytics, entity analytics, decision management and optimization to deliver insights in near real-time. For this use case, we used SPSS Modeler 18.1 to create a visual representation of the solution, or in SPSS terms, a stream. That’s right — not one line of code was written in the making of this blog.

We also used The Weather Company APIs to retrieve historical weather data for the ten airports over the year 2016. IBM SPSS Modeler supports calling the weather APIs from within a stream. That is accomplished by adding extensions to SPSS, available in the IBM SPSS Predictive Analytics resources page, a.k.a. Extensions Hub.

A proposed solution

In this blog, we propose one possible solution for this problem. It’s not meant to be the only or the best possible solution, or a production-level solution for that matter, but the discussion presented here covers the typical iterative process (described in the sections below) that helps us accumulate insights and refine the predictive model across iterations. We encourage the readers to try and come up with different solutions, and provide us with your feedback for future blogs.

Business and data understanding

The first step of the iterative process includes understanding and gathering the data needed to train and test our model later.

Flights data — We gathered 2016 flights data from the US Bureau of Transportation Statistics website. The website allows us to export one month at a time, so we ended up with 12 csv (comma separated value) files. We used IBM SPSS Modeler to merge all the csv files into one set and to select the ten airports in our scope. Some data clean-up and formatting was done to validate dates and hours for each flight, as seen in Figure 1.

*Figure 1 — gathering and preparing flights data in IBM SPSS Modeler*

Weather data — From the Extensions Hub, we added the TWCHistoricalGridded extension to SPSS Modeler, which made the extension available as a node in the tool. That node took a csv file listing the 10 airports latitude and longitude coordinates as input, and generated the historical hourly data for the entire year of 2016, for each airport location, as seen in Figure 2.

*Figure 2 — gathering and preparing weather data in IBM SPSS Modeler*

Combined flights and weather data — To each flight in the first data set, we added two new columns: ORIGIN and DEST, containing the respective airport codes. Next, flight data and the weather data were merged together. Note: the “stars” or SPSS super nodes in Figure 3 are placeholders for the diagrams in Figures 1 and 2 above.

*Figure 3 — combining flights and weather data in IBM SPSS Modeler*

Data preparation, modeling, and evaluation

We iteratively performed the following steps until the desired model qualities were reached:

· Prepare data

· Perform modeling

· Evaluate the model

· Repeat

Figure 4 shows the first and second iterations of our process in IBM SPSS Modeler.

*Figure 4 — iterations: prepare data, run models, evaluate — and do it again*

First iteration

To start preparing the data, we used the combined flights and weather data from the previous step and performed some data cleanup (e.g. took care of null values). In order to better train the model later on, we filtered out rows where flight cancellations were not related to weather conditions (e.g. cancellations due to technical issues, security issues, etc.)

*Figure 5 — imbalanced data found in our input data set*

This is an interesting use case, and often a hard one to solve, due to the imbalanced data it presents, as seen in Figure 5. By “imbalanced” we mean that there were far more non-cancelled flights in the historical data than cancelled ones. We will discuss how we dealt with imbalanced data in the following iteration.

Next, we defined which features were required as inputs to the model (such as flight date, hour, day of the week, origin and destination airport codes, and weather conditions), and which one was the target to be generated by the model (i.e. predict the cancellation status). We then partitioned the data into training and testing sets, using an 85/15 ratio.

The partitioned data was fed into an SPSS node called Auto Classifier. This node allowed us to run multiple models at once and preview their outputs, such as the area under the ROC curve, as seen in Figure 6.

*Figure 6 — models output provided by the* ***Auto Classifier*** *node*

That was a useful step in making an initial selection of a model for further refinement during subsequent iterations. We decided to use the Random Trees model since the initial analysis showed it has the best area under the curve as compared to the other models in the list.

Second iteration

During the second iteration, we addressed the skewedness of the original data. For that purpose, we chose one of the SPSS nodes called SMOTE (Synthetic Minority Over-sampling Technique). This node provides an advanced over-sampling algorithm that deals with imbalanced datasets, which helped our selected model work more effectively.

*Figure 7 — distribution of cancelled and non-cancelled flights after using* ***SMOTE***

In Figure 7, we notice a more balanced distribution between cancelled and non-cancelled flights after running the data through SMOTE.

As mentioned earlier, we picked the Random Trees model for this sample solution. This SPSS node provides a model for tree-based classification and prediction that is built on Classification and Regression Tree methodology. Due to its characteristics, this model is much less prone to overfitting, which gives a higher likelihood of repeating the same test results when you use new data, that is, data that was not part of the original training and testing data sets. Another advantage of this method — in particular for our use case — is its ability to handle imbalanced data.

Since in this use case we are dealing with classification analysis, we used two common ways to evaluate the performance of the model: confusion matrix and ROC curve. One of the outputs of running the Random Trees model in SPSS is the confusion matrix seen in Figure 8. The table shows the precision achieved by the model during training.

*Figure 8 — Confusion Matrix for cancelled vs. non-cancelled flights*

In this case, the model’s precision was about 95% for predicting cancelled flights (true positives), and about 94% for predicting non-cancelled flights (true negatives). That means, the model was correct most of the time, but also made wrong predictions about 4–5% of the time (false negatives and false positives).

That was the precision given by the model using the training data set. This is also represented by the ROC curve on the left side of Figure 9. We can see, however, that the area under the curve for the training data set was better than the area under the curve for the testing data set (right side of Figure 9), which means that during testing, the model did not perform as well as during training (i.e. it presented a higher rate of errors, or higher rate of false negatives and false positives).

*Figure 9 — ROC curves for the training and testing data sets*

Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to further refine this model or even to use other models that could solve this use case.

Deploying the model

Finally, we deployed the model as a REST API that developers can call from their applications. For that, we created a “deployment branch” in the SPSS stream. Then, we used the IBM Watson Machine Learning service available on IBM Bluemix here. We imported the SPSS stream into the Bluemix service, which generated a scoring endpoint (or URL) that application developers can call. Developers can also call The Weather Company APIs directly from their application code to retrieve the forecast data for the next day, week, and so on, in order to pass the required data to the scoring endpoint and make the prediction.

A typical scoring endpoint provided by the Watson Machine Learning service would look like the URL shown below.

https://ibm-watson-ml.mybluemix.net/pm/v1/score/flights-cancellation?accesskey=<provided by WML service>

By passing the expected JSON body that includes the required inputs for scoring (such as the future flight data and forecast weather data), the scoring endpoint above returns if a given flight is likely to be cancelled or not. This is seen in Figure 10, which shows a call being made to the scoring endpoint — and its response — using an HTTP requester tool available in a web browser.

*Figure 10 — actual request URL, JSON body, and response from scoring endpoint*

Notice in the JSON response above that the deployed model predicted this particular flight from Newark to Chicago would be 88.8% likely to be cancelled, based on forecast weather conditions.

Conclusion

IBM SPSS Modeler is a powerful tool that helped us visually create a solution for this use case without writing a single line of code. We were able to follow an iterative process that helped us understand and prepare the data, then model and evaluate the solution, to finally deploy the model as an API for consumption by application developers.

Resources

The IBM SPSS stream and data used as the basis for this blog are available on GitHub. There you can also find instructions on how to download IBM SPSS Modeler, get a key for The Weather Channel APIs, and much more.