Which features impact a travel Package cancellation?

7 min readFeb 22, 2023

The travel sector is a crucial part of the global economy. Tourism accounted for 1 in 4 of all new jobs created globally and 10% of the global Gross Domestic Product (GDP) before the Covid-19 pandemic, accordingly to the World Travel & Tourism Council (WTTC)’s Economic Impact Reports. The pandemic substantially impacted tourism, leading to high unemployment rates and a massive drop in global GDP contribution.

Right after the pandemic, the need to develop flexible travel products increased, not only in the financial aspect but personalized products. And the companies that already had this kind of product in their catalogs went ahead.

This article aims to share an exploratory data analysis (EDA) of a Brazilian Online Travel Agency (OTA)’s product called travel package. The idea is to explore focusing on cancellations and to try to answer business questions such as:

Question 1: Are international travels more canceled than national ones?
Question 2: Is the number of status changes related to the cancellation process?
Question 3: Which are the features that have the most impact on cancellation?

The travel packages product

Travel packages are a type of product that combines hotel reservations and flight tickets. When the package already has a specific date for the client to travel, it’s called a ‘fixed date package.’ When it’s not attached to a particular date, it’s called a ‘flexible date package.’

Flexible date packages allow the client to fill up a form informing three desired dates to travel. It has competitive prices and flexible payment options, a top-selling product. Usually, this product has an extensive valid period, reaching two years in many cases, and flexible cancellation policies.

The process of hotel reservation and flight ticket scheduling commonly occurs within 45 days before the first desired date. And this process is called ‘operation.’

Due to the vast time range between the order and the travel itself, financial planning is complex. Additionally, the uncertainty of the cancellation volume adds to this complexity.

Diving into the analysis

The data relates to purchases from January 2019 onwards, only to flexible travel packages, and all the string features were anonymized for privacy.

After data cleaning and preprocessing, we have these features:

We’ll not cover all the features, but to bring the main findings and insights.

We can look at the cancellation feature below to start understanding our data. The canceled orders represent almost 30% of the cases in the analyzed dataset.

Accordingly to the figure below, two states are the main origin of the purchases, being that one represents more than 50% of all orders.

But proportionally, the top 5 best sellers have very similar cancellations rates.

Cancellation proportion by origin state.

Question 1: Are international travels more canceled than national ones?

Within the destination type, most of it have national destination. And proportionally, international destinations have slightly more cancellations than national ones.

Percentage of canceled orders by destination type.

Proportion of destination type that where canceled or not.

We limited the plot to the top 35 countries to keep the graph interpretable. The distribution of the orders based on destination country have similar performance as the origin states. There are few contries that represent the great majority of the cases.

Purchases and cancellations by destination country.

Cancellation proportion by destination country.

Also limiting the plot to the top 35 destination states. It’s possible to verify that the 5th top seller has a higher rate of cancellation between the top 5 ones. This may be due to problems related to this specific destination.

Purchases and cancellations by destination state.

Cancellation proportion by destination state.

Thinking about the operation status now. As said before, the operation is the flight and hotel reservation process. More than a half of the cases had their operations started before the cancellation. So, the operation status does not have a impact in the cancellation decision.

Cancellations accordingly to operation status.

We computed the difference between the order_date and the last_update_status_date to verify the time difference between the purchases and it’s cancellations.

There is no time_to_cancel with more than 5% of share in the sample, the most representative is 0 days to cancel, but the data have a high cardinality. We have a better visualization with a month granularity. There’s also no pattern related to the time to cancel a order after the purchase.

Most of the cancelled orders had 7 dailies e the majority (around 80%) of cases had 7 or less dailies.

Almost 60% of the cases had 2 persons and around 20% had just one traveler.

Question 2: Is the number of status changes related to the cancellation process?

Around 25% of the cases had just 2 status changes, in this cases maybe the order was cancelled right away. Less than 25% os the cases had more than 3 status changes.

From 3 status changes, there’re large volumes of cancellations, answering our second business question. More changes of status lead to higher cancellation rates.

Proportionally, most of the orders had their operation started before the cancellation for cases from 3 status changes or more. The case of 4 status changes is the one with the biggest difference.

Feature Importance

Now we’re going to use a Catboost Classifier to understand which features has more impact on package cancellation. For this, we’ll consider categorical and numerical features. We’ll disconsider the date features but maintain the numerical ones that refers to it’s relationships.

We transformed the features with LabelEncoder, converted the target to ones and zeros, split the data into train and test datasets and fitted the model. Then, we used the shape library to verify which features were more important for the model’s performance.

Below we have an example case where:

The blue represents the features pushing the result towards zero, i.e. towards a ‘no cancellation’ label;
On the other hand, the read represents the features pushing the result towards one, i.e. towards a ‘cancellation’ label.

This is a ‘not canceled’ example:

Waterfall graph for a not canceled order.

Below there is a ‘cancelled’ example. One more time we have the feature qty_status_change as the most important. The feature operation_started has also high influence on the output.

As we saw the feature qty_status_changes as the most impactful for the examples above, let’s try to understand the effect of this feature in the model output.

The scatter picked time_to_cancel_days to explain the qty_status_changes impact. This is not the best visualization we could have, but it appears that for most of the qty_status_changes values, there’s a range of SHAP values combined with high time_to_cancel_days, leads to a cancellation. But there is not a well defined threshold at the qty_status_change feature.

Scatter plot for quantity of status changes.

Question 3: Which are the features that have the most impact on cancellation?

Below we have the feature importance for the model, the graph shows the distribution of the each feature’s impact on the output. Based on the plot:

High SHAP values for qty_status_changes, increases the predicted value (towards ‘canceled’), and low values reduces the predicted value;
High values of operation_started reduces the predicted value (towards ‘not canceled’), and high values increases the predicted value.

Other features as time_to_cancel_days, accommodation_type, destination_country and qty_people have also a satisfactory differentiation for high and low SHAP values.

Finally, below there’s the mean absolute value of the SHAP for each feature, validating the qty_status_change as the most impactful feature for cancellation. Followed by operation_started and time_to_cancel_days.

Conclusion

This exploratory data analysis aimed to understand more about characteristics of package cancellations and to answer few business question.

Throughout the analysis it was possible identify that international travels are more canceled than national travels, proportionally. National travels are the majority of the purchases though.

We also verified that orders with more status changes have more cancellations, proportionally.

And, through the feature importance analysis, we were able to assert that the most impactful features for cancellation are qty_status_changes, operation_started and time_to_cancel_days.

This article is part of the Data Science Nanodegree from Udacity.