Recommender systems ft. SHAP — Making business travel approval easier

Published in

Amex GBT technology

10 min readSep 6, 2023

As a travel management company (TMC), Egencia offers B2B customers travel management tools, among which approval is an important one. More than 50% of Egencia’s customers use approval for at least one type of business trip, i.e., have at least a configuration requiring approval for a trip, and it allows customers to verify that their employees follow their company travel policies.

Basic trip approval functionality

Let us assume a travel manager decides to activate approval for all the employees who book an international flight more expensive than $500. In this case:

Every time a user books an international flight above this amount, say $550, the booking is finalized only once their approver approves it. The approver might ask the user to change this flight and in this case the booking is rejected.

If an employee of this company books a domestic flight of any amount or an international flight that costs less than the configured threshold, say $400, Egencia will confirm the booking at once without the need for approval.

*Figure 1 — A simplified view of the Approval flow*

What do we want to improve?

All the approval criteria, i.e. the booking type (flight tickets, hotel bookings, rail tickets, car rental); policy status (any/only out of policy); geography type (domestic, international, intercontinental); trip price (lower/greater than); and others policy-specific, can be combined in multiple ways. Now, imagine that there are tens of such combinations that a customer can choose to activate automated approval with Egencia.

Given the flights approval configured:

*Figure 2: Flights approval configurations on Egencia*

Let us have a look at the 10 most popular combinations of flights approval criteria observed among Egencia customers:

*Figure 3: Top 10 most popular flights approval criteria combinations*

We observe that:

(purple bulb) 38% Policy related criteria send bookings for approval
(yellow bulb) 50% criteria involve Geography
(blue bulb) 15% criteria are related to Price, and they are usually combined with geography

These are only 10 out of other tens of unique approval configurations observed among our customers in all lines of business.

All new customers that join Egencia can configure several trip approval settings for their company, in turn that approval process is immediately in place for online, mobile app and offline bookings with Egencia’s customer services team. A frequent question we get from new customers is, “what do your other customers do?”

“Would be good if benchmarking data compared to like industry is available. My company really only cares about what other software companies are doing.”

Can we do better than this manual approach?

As of January 2023, more than 50% of Egencia customers require approval to be used in at least one of their employee groups. After an in-depth exploratory data analysis and a proof-of-concept, we could prove that there are certain patterns in the data that allow us to predict whether approval is beneficial to a customer.

*Figure 4: Does approval setup vary by point of sale?*

In the figure above, we provide an example of the exploratory analysis where we observed differences of approval usage by region. We observed Country (POS) 21 and Country (POS) 22 to be slightly less than the global average; also, that a specific geographical region uses approval the least. This could represent cultural nuances.

In one of the first proofs-of-concept, we developed a classification model that predicted whether a Small Medium Enterprise (SME) customer requires approval or not. From the total of SME customers, we divided them 80/20 into train and test datasets. Using the customer attributes describing each customer as features and the information about their approval usage as a label, we built a basic Random Forest classifier with 200 trees on the training dataset that we evaluated on the testing set of customers for which we already knew their approval usage. We could obtain an f1-score of 0.69 that was sufficient to prove the existence of an identified pattern.

Looking at the feature importance of the built proof-of-concept classifier, we observed that the attributes capturing the size of the company and the geographical regions played a significant role in predicting the approval usage flag for SME customers.

Remark: The method above represents one possibility to compute how much each feature contributes to decreasing the weighted impurity in decision trees. Given the Random Forest model, we are talking about averaging the decrease in impurity over trees. This approach might be biased to inflate the importance of continuous features or high-cardinality categorical variables

What was the first thing we could improve?

Together with the approval Product Managers, we decided to automatically recommend approval configurations to new customers, based on the settings of Egencia’s existing customers with approvals activated.

How were we going to do it?

We treated the problem as a recommender systems problem, where we defined the “users” in the traditional recommender systems terminology to be the Egencia customers (B2B) and the “items” to be the “Travel approval configurations”.

Figure 6: Brief on Recommender Systems
Source: https://towardsdatascience.com/brief-on-recommender-systems-b86a1068a4dd

Recommender systems are software tools and techniques that provide suggestions for items that are most likely of interest to a particular user.

What recommender system approach better suits our use-case?

Given the main objective to recommend approval configurations to new customers, the content-based methods were our choice so that we can use the exclusive customer profile data. A content-based recommendation system can be implemented using very simple classification or regression models per item to be recommended, but also using more complex multi-label classification models, including Deep Neural Networks.

This approach is mostly used in a hybrid manner, in combination with collaborative filtering, but still present in Pandora (user-based i.e. using item profiles), Stitch Fix fashion box (item-based i.e. using user profiles), Linkedin (user profiles), and Amazon (item profiles + user demographics).

We identify the following advantages of using content-based methods:

They suffer far less from the cold start problem than collaborative approaches: new customers can be described by their characteristics (content) and so relevant suggestions can be done for these new entities.
They are computationally fast and easily interpretable.

Though, they also have a drawback given the class imbalance. It might pose a performance problem: not all approval configurations are equally distributed among Egencia customers.

Building the recommender system

We solved the recommender system problem using a multi-task classifier in which, for a given Egencia customer represented by its profile, we predicted whether each class (not mutually exclusive) should be used by all users, a part of the users, or none of the users for each class.

You’re probably wondering what we mean by a “class.” To be able to train a multi-task classifier we had to transform the dataset of approval configurations in a way in which we could have a label associated to every approval configuration. For this reason, we encoded in a class from 0 to n (focusing on the most popular settings observed among Egencia customers) the approval configurations defined by the line of business involved and the criteria. Please find below a snapshot capturing this data transformation:

*Figure 7: Transforming the approval configurations dataset into a multi-task classifier dataset*

Remark: Apart from the auxiliary class_0, which represents the approval usage overall flag, the dataset is highly imbalanced.

Once the data were in a suitable shape, we could experiment with different modelling approaches for the multi-task classification problem: for a given customer we wanted to predict the labels that we recommend for each approval setting possible. For example:

Customer A:
TRUE for class_0,
ALL_USERS for class_1, MAJORITY_USERS for class_16 and NOT_USED for all the other 22 classes.
Customer B:
FALSE for class_0,
and NOT_USED for all the other n-1 classes.

*Figure 8: Visual representations of the recommender systems experiments*

When evaluating the classifiers, we first focused on the recall and precision. We also decided with the Product team that we should optimize the models more for precision as we wanted to be sure that the classes (approval configurations) we recommended to be used by at least a part of the users were relevant, i.e. used by those customers, even if there were classes that we did not recommend and that were actually used.

This, together with an analysis of the dominant models in terms of these metrics as well as custom metrics discussed with the Product team, allowed us to pick the best modelling approach. As depicted in the graph above, it was the LightGBM classifier used in a MultiOutput wrapper with class weighting option to handle the highly imbalanced dataset.

On the testing set we observed a micro-average precision of 0.29 (+363% wrt baseline “always required for all users”) and a micro-average recall equal to 0.49 (+159% wrt baseline “never required”). Given the highly imbalanced classes, these results were not surprising, especially in the cold start context in which we could use only a limited number of attributes describing the customer profile at implementation time. It can definitely be improved using a hybrid approach where we consider interactions of the customers with other settings, as well as more behavioral attributes as soon as they start using Egencia.

*Figure 10: Detailed performance by approval configuration*

Looking at the model’s performance by classes (approval configurations) as presented in Figure 10, we decided to limit this version of the classifier to the 14 classes for which the recall had a value above 0.1. This has also been led by the additional discussions with the Product team.

How did we build trust?

Especially from a business perspective, we can often hear that, “Having an accurate model is good, but explanations lead to better products” or from a technical perspective that, “What makes machine learning algorithms difficult to understand is also what makes them excellent predictors.” I would say that both are right.

*Figure 11: Interpretability vs performance seen through machine learning techniques (non exhaustive) Source:* https://www.researchgate.net/publication/341509975_Agent-Based_Explanations_in_AI_Towards_an_Abstract_Framework

Luckily, the Explainable Artificial Intelligence (xAI) is an emerging field. It aims to overcome this problem and we could explore it in the case of this project, so, for every recommendation we can provide transparency and the model’s reasoning. Below is an example message we displayed for one of the recommendations:

If you are curious what’s behind this dynamic message, we used SHAP (SHapley Additive exPlanations), a method that studies the local interpretability of a predictive model.

What now?

This first iteration of the approval recommender system has been deployed to all new customers joining Egencia since January 2022. The Machine Learning model, based on LightGBM, gives tailored recommendations to travel managers on how to set approval by comparing recommendations set for similar customers. The tool uses internal information such as the customer size, geographical region, industry and other attributes available at implementation time.

So far, 74 new customers accepted explicitly or implicitly the approval tool recommendations. It’s interesting to observe that there are also customers that tried the recommendation tool multiple times and seem to look for information.

We’ve developed a detailed process view of what is happening on the page to discuss improvement steps in the future. The Directly-Follows graph above represents the process discovered using the events logs from the approval recommendation tool page. It’s a graph where the nodes represent the activities in the log, e.g. RECOMMENDATION START, and directed edges are present between nodes if there is at least a trace in the log where the source activity is followed by the target activity. On top of these directed edges, it’s easy to represent metrics like frequency, counting the number of times the source activity is followed by the target activity.

On top of that, feedback is being collected by the Product team to better understand whether travel managers are looking for information when coming back to the page and prefer to do the final action manually or whether they had in mind other settings. As a remark, the RECOMMENDATION PENDING events are related to cases where the customer profile is not yet finalized.

With this launch we’re able to gather implicit and explicit feedback from our customers that will become key to improving and developing further capabilities of the tool. We will iterate and use classifiers with more features and more complete and accurate data, as well as advanced hybrid recommender systems techniques.

Kudos to Kriti Agarwal & Palak Khantal for turning this Machine Learning project into reality, to Dr. Begoña Ascaso for reviewing my work and to Chloe Angel & Natasha Samuel for helping me make this article sound and look better.