Unfriendly Skies: Predicting Flight Cancellations Using Weather Data, Part 3

Tim Bohn
Inside Machine learning
7 min readDec 13, 2017

Tim Bohn and Ricardo Balduino

Piarco Airport, Trinidad in the 1950s, Copyright John Hill, Creative Commons Attribution-Share Alike 4.0

In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict flight cancellations. Specifically, we hoped to predict the probability of the cancellation of flights between the ten U.S. airports most affected by weather. We used historical flight data and historical weather data to make predictions for upcoming flights.

In Part 2, we started our exploration with IBM SPSS Modeler and APIs from The Weather Company. With this post, we look at IBM’s Watson Studio.

Tools used in this use case solution

Watson Studio is a collaborative platform for data scientists, built on open-source components and IBM added value, which is available in the cloud or on-premise. In the simplest terms, Watson Studio is a managed Apache Spark cluster with a Notebook front-end. By default, it includes integration with data tools like a data catalog and data refinery, Watson Machine Learning services, collaboration capability, Model Management, and the ability to automatically review a model’s performance and refresh/retrain the model with new data — and IBM is quickly adding more capabilities. Read here to see what IBM is doing lately for data science.

A python notebook solution

In this case, we followed roughly the same steps we used in the SPSS model from Part 2, only this time we wrote python code in a Jupyter notebook to get similar results. We encourage readers to come up with their own solutions. Let us know. We’d love to feature your approaches in future blog posts.

The first step of the iterative process is gathering and understanding the data needed to train and test our model. Since we did this work for part 2, we made use of the analysis here.

Flights data — We gathered data for 2016 flights from the US Bureau of Transportation Statistics website. The website allowed us to export one month at a time, so we ended up with twelve csv (comma separated value) files. Importing those as dataframes and merging into a single dataframe was straightforward.

Figure 1 — Gathering and preparing flight data in IBM DSX

Weather data — With the latitude and longitude of the 10 Most Weather-Delayed U.S. Major Airports, we used one of the Weather Company’s API’s to get the historical hourly weather data for all of 2016 for each of the 10 airport locations and created a csv file that became our data set in the notebook.

Combined flights and weather data — To each flight in the first data set, we added two new columns: ORIGIN and DEST, containing the respective airport codes. Next, we merged flight data and the weather data so that the resulting dataframe contained the flight data along with the weather for the corresponding Origin and Destination airports.

Data preparation, modeling, and evaluation

To start preparing the data, we used the combined flights and weather data from the previous step and performed some cleanup. We deleted columns of features that we didn’t need, and replaced null values in rows where flight cancellations were not related to weather conditions.

Next, we took the features we discovered when we created a model using SPSS (such as flight date, hour, day of the week, origin and destination airport codes, and weather conditions) and we used them as inputs to our python model. We also chose the target feature for the model to predict: the cancellation status. We deleted the remaining features.

Next, we ran OneHotEncoder on the four categorical features. One-hot encoding is a process by which categorical features get converted into a format that works better with certain algorithms, like classification and regression. Figure 2 shows the number of feature columns, expanded significantly with one hot encoding.

Figure 2 — One-hot encoding expands 4 feature columns into many more

Interestingly, the flight data is heavily imbalanced. Specifically, as seen in Figure 3, of all the flights in the data set only a small percentage are actually cancelled.

Figure 3 — Historical data: distribution of cancelled (1) and non-cancelled (0) flights

To address that skewedness in the original data, we tried oversampling the minority class, under sampling the majority class, and a combination of both — but none of these approaches worked well. We then tried something called SMOTE (Synthetic Minority Over-Sampling Technique), an algorithm that provides an advanced over-sampling algorithm to deal with imbalanced datasets. Since it generates synthetic examples rather than just using replication, it helped our selected model work more effectively by mitigating the problem of overfitting that random oversampling can cause. SMOTE isn’t considered effective for high dimensional data, but that isn’t the case here.

In Figure 4, we notice a balanced distribution between cancelled and non-cancelled flights after running the data through SMOTE.

Figure 4 — Distribution of cancelled and non-cancelled flights after using SMOTE

It’s important to mention is that we applied SMOTE only to the training data set, not the test data set. A detailed blog by Nick Becker guided our choices in the notebook.

At this point, we used the Random Forest Classifier for our model. It did the best when we used SPSS so we used again in our notebook. We have several ideas for a second iteration of our model in order to tune it, one of which is to try multiple algorithms to see how they compare.

Since this use case deals with classification analysis, we used some of the common ways to evaluate the performance of the model: the confusion matrix, F1 score and ROC curve, among some others. Figures 5 and 6 show the results.

Figure 5 — Test/Validation Results
Figure 6 — ROC curve for training data set

Figure 6 is the ROC curve from the training data set. Figure 5 shows us that the results from the training and test data sets are pretty close, which is a good indication of consistency, though we realize that with some tuning it could get better. Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to refine the model further or even to use other models to solve this use case.

Conclusion

This was a project to compare creating a model in IBM’s SPSS with IBM’s Watson Studio. SPSS offers a no-code experience while Watson Studio offers the best of open-source coding capability with many IBM value adds. SPSS is an amazing product and gets better with every release, adding many new capabilities.

IBM’s Watson Studio is a great platform for both the beginning and experienced data scientist. Anyone can log in and have immediate access to a managed Spark cluster with a choice of a Jupyter notebook front-end using Scala, Python or R, SPSS and visual data modeler (no coding). It offers easy collaboration with other users, including adding other data scientists who could then look over our shoulders and make suggestions. The community is active and has already contributed dozens of tutorials, data sets and notebooks. If we had added Watson Machine Learning, we could very easily have deployed and managed our model with an instant REST endpoint to call from any application. If our data was changing, we could have WML review our model periodically and retrain it with any new data if our metric (ROC Curve) value fell below a given threshold. That, along with new data cataloging and data refinery tooling added recently, make this a platform worth checking out for any data science project.

SPSS has a lot, but not everything. Writing the python code in a notebook was a bit more time-consuming than what we did in SPSS, but it also gave quite a bit more flexibility and freedom. We had access to everything in the python libraries, and of course, one of the benefits of python as an open-source language is the trove of helpful examples.

I would say both platforms have their place, and neither can claim to be better for everything. Those doing data science for the first time will probably find SPSS an easier place to start given its drag-and-drop user interface. Those who have come out of school as programming wizards will want to write code, and DSX will give them a great way to do that without worrying about installing, configuring, and correctly integrating various product versions.

Resources

The IBM notebook and data that form the basis for this blog are available on Github.

--

--

Tim Bohn
Inside Machine learning

Data Scientist and Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries) and Technology. Tweets are personal opinions.