Unsupervised learning for e-commerce improvement: Trivago hotels clusterization case.

Published in

hurb.engineering

9 min readAug 9, 2021

Image from: https://www.mygreatlearning.com/blog/clustering-algorithms-in-machine-learning/

It is already known that Machine Learning is vastly used to optimize several industry and marketing cases. Here at Hurb for example, as a platform that always wants to optimize travel through technology, we apply many solutions from Deep Learning hotel’s image selection to product’s recommendation for our travelers.

However, sometimes we face a not so standard Machine Learning problem with no “template” solution made yet. That was the case of the Trivago hotels clusterization project, which can be generalized for any product grouping method inside an e-commerce case.

Ok, now that I hope I’ve caught your attention, let’s slow down a little and cover some concepts (1–4) before explaining the problem itself and the results (5–7)

The e-commerce Funnel.
Metasearchers functioning.
Why Hotel’s grouping?
The buckets and the need for an optimal and productionized grouping.
The Machine Learning pipeline.
The results.
Conclusion and references.

1- The e-commerce Funnel

First, we want to highlight is a characteristic of most e-commerce cases, the e-commerce funnel.

Image from https://www.robbierichards.com/seo/seo-metrics/

It’s simply the representation of the actions of the customer. He/She sees the product (impression), it catches his/her interest (click), then the person engages (engagement) and buys the product (conversion).

For now, let’s just stick a pin in this concept and come back to it when we’re talking about feature engineering in topic 5.

2- Metasearchers functioning

Now, heading to our travel business use-case, this funnel illustrates the purchase flow of many metasearch platforms where we display our hotels, just like Trivago and TripAdvisor.

Metasearchers are platforms where travelers search hotels among different OTAs (Online Travel Agency), for a specific destination, during a specific time. A very summarized way to explain how metasearchers show the OTA’s hotels is the example of a shop window or showcase. The store (metasearch) displays the products (hotels) based on how much the product suppliers (OTAs) pay for each click.

So if we (OTA) want to display our hotels on a metasearch top position search result, we need to “win an auction” with our bids. And for each click we get, we pay the value of our bid for them.

3- Why hotel’s grouping?

As the metasearchers know this flux, they’ve created the possibility for each OTA to sector their hotels to best adapt its bid values based on two search parameters:

The Length Of Stay (LOS)
The Time To Travel (TTT) — Check-In date minus the date of search.

This enables us to better choose and compete for a specific segment. That sectorization is called Multipliers or Multiplicators.

Let’s say, for example, that we are 1 month before Carnival festival in Brazil. If we want to potentialize our sales for local hotels in Rio de Janeiro during this period, we would augment our multipliers to match Carnival holiday. So our final bid’s value would be higher, increasing our chances to appear on the top position search results for that city, in that period.

4- The buckets and the need of an optimal and productionized grouping.

Okay, now the last step before heading into the Machine Learning part.

At this point, it’s obvious that multiplicators can enhance our e-commerce strategies. However, our hotel inventory is extremely high (more than 800K), making it impossible for any team to choose the specific multipliers for each hotel.

That’s why the metasearchers created a bucket strategy. It comprises a division of 10 buckets where each one will have a standard Multiplier value for LOS and TTT. Enabling us to better drive our business strategy by grouping our hotels because of their characteristics of LOS and TTT.

For example, a designated bucket for executive hotels will have higher multipliers for low LOS and TTTs, once we expect business trips to be short and bought close to the stay. Whereas a bucket of resorts would have higher multipliers for high LOS and high TTT, once we expect it’s a vacation trip and people stay longer, scheduling it in advance.

Regarding all this background we can see a clusterization problem formed, and our goal is to group the hotels where we want to apply the same multipliers.

5- The Machine Learning pipeline

Once we are dealing with an unsupervised learning problem, it is extremely important to well dimension two points:

Which train of thought we are trying to reproduce.
How we will assess if the clusters are good or not.

The answer to those two questions remains on the reproducibility of the stakeholder manual grouping.

Our marketing stakeholders usually use conversions, impressions and income for more significant LOS and TTT segments, creating groups of hotels which we invest more in. So, our decision was to reproduce this task.

Finally, our final assessment will consist of 3 evaluations:

How close they are to the stakeholder’s manual grouping.
If clusters reproduce characteristics not provided on features (similar cities for example).
Unsupervised learning evaluation metrics like inter/intra cluster distances and silhouette curves.

Feature Engineering

Bearing in mind that we want to reproduce and facilitate the stakeholders’ job to apply multipliers on each class of TTT and LOS, we built our features considering the KPIs performance for each respective sector.

For example, instead of considering all the impressions of a specific hotel, we sectorized those impressions for each group of LOS and TTT.

The last part of the feature engineering was to choose which KPIs they would use inside that segmentation. And here is where the e-commerce funnel enters! We considered each step of the customer’s path until the checkout (order/conversion) is done, and we divided their values by category of LOS and TTT.

Note: during our tests we tried other travel business KPIs like average ticket, cities and countries, however the e-commerce funnel KPIs were by far the most efficient ones, considering the unsupervised learning metrics and the stakeholders’ evaluations.

Pipeline and the pre-processing leap

Ok, now let’s have just a quick recap. So far we’ve presented a sectorization of our hotels (by Length Of Stay — LOS and Time To Travel — TTT) that will create our features, categorizing each products’ KPIs inside that sectorization. The KPIs chosen were the steps of the e-commerce funnel. And finally, the aim of our clusterization is to group products with similar behavior inside each sector, better driving our business strategies for them.

Now illustrating our pipeline:

After we did that feature engineering process, we created a DataFrame with a huge amount of features; so, in order to reduce the dimensionality of it, we applied a Principal Component Analysis (PCA) method.

Therefore, we considered the linearity of our features — as it is a funnel, the KPIs’ values will always reduce as the funnel arrives to its end — to choose the K-Means as the clusterization model. The intrinsic features’ linearity was also a point to justify the use of PCA as the dimensionality reduction method.

Another point to justify the use of K-Means was the fact that we already have a range of clusters in mind (Trivago has a disponibility of 10 buckets). Then, we could apply an elbow curve method to find out the optimal point of K inside that range of cluster values.

Elbow curves on some different data set arrangements.

Finally, before inputting our principal components into the K-Means algorithm, we needed to scale it, and that was the most crucial part. In the beginning we tried some Robust and MinMax scalers, however, the first results weren’t that exciting until we changed the scaling method, as we can see in the next section.

6- Results

This evaluation report consists of 4 analyses for 3 different values of K, the K chosen by the elbow method and the two following values.

The 4 analyses were:

The amount of hotels in each cluster, by absolute and percentage values.
The inter-cluster distance map. Which consists of representing the intra-cluster distance (by the size of the bubble) and the inter-cluster distance (by the distance between the numbers) on an embedded dimension.
The silhouette curve.
The scatter-plot for the 2 most significant principal components.

As we could see, the model massively agglomerated the hotels on the biggest cluster (around 70%), distorting the silhouette and distance map metrics.

Besides, the stakeholders’ feedback was quite negative, showing that this huge bucket contained most of the “real” buckets, and the other ones aside from being too small were representing other characteristics, not that important for the multipliers bidding process.

To reach better results, we did a grid search by varying the number of principal components and the scalers hyper-parameters. However, we didn’t optimize it that much, we were still finding a big 70% cluster, while the stakeholders’ optimal division was way more distributed.

This problem persisted until a great leap was done, and that leap was the Quantile Transformer.

Simply by applying the Quantile Transformer we achieved the following results:

Even without reaching a good silhouette curve score, we found it way better when analyzing the scatter-plot and the bucket’s quantity, observing that this attempt was more distributed than the previous one. Furthermore, when we sent the results for the stakeholders’ analysis, we had an impressive feedback that the model “guessed” right the 4 main buckets they usually do manually.

Finally, we could see a really representative pattern of other hotels’ characteristics, like cities for vacations and cities for quick business trips.

We even reached a better point of division, the clusterization method now sectors some hotels from the same cities — a feature not provided to our model — distinguishing them by different prices and categories, represented by the bidding metrics on the picture, which are important characteristics when analyzing multipliers during the bidding process. We can see this on bucket 0 and bucket 3.

7- Conclusion and references

First, some conclusions about the business perspective.

It was impressive that this clusterization technique proved to work really well on other similar e-commerce grouping cases. Being able to generalize to other search engines, we have internally optimized the business decisions of each sector. This idea was born on a benchmarking study inside TripAdvisor hotel sectorization, initially using travel business KPIs and lately being optimized to use e-commerce funnel KPIs.

In addition, we must highlight the effectiveness of the outcomes, resulting in saving time for stakeholders being able to focus their efforts on other problems. Besides, the other “unknown” buckets might give new insights for their business understanding.

Second, from the Machine Learning perspective, for sure we always have the tuning possibility to optimize our results, by varying hyper-parameters or even applying other M.L. techniques and frameworks, as we quickly tried some implementations using H2O AutoML.

AutoML approach was a brief attempt to optimize the silhouette value score, and we actually improved, however, the optimization was less than 5%, which didn’t justify the implementation of this framework inside our production pipeline, since its productization is more complicated and cumbersome than the simple use of Scikit library.

We did our productization using Apache Airflow, it gives the possibility to orchestrate the running tasks following a graph format. Airflow also enables synchronization of the runs based on dates and frequency of how often the model runs changing the parameters.

Finally, after all the processing and clusterization tasks, Airflow creates answer tables directly in our data warehouse, as well as sending them to our stakeholders by email.

Bellow, we show our pipeline graph for 2 market position configurations (US and BR).

Last, it’s worth emphasizing that we are dealing with linear features, which justifies the use of more standard and “simple” techniques like PCA and K-Means (in place of Deep Learning techniques), being able to meet the needs of our problem and result in an easy product to deploy.

References:

Airflow: https://airflow.apache.org/

Elbow curve method: https://en.wikipedia.org/wiki/Elbow_method_(clustering)

Intercluster and Intracluster Distance: https://www.geeksforgeeks.org/ml-intercluster-and-intracluster-distance/

KMean: https://www.youtube.com/watch?v=4b5d3muPQmA&t=206s

PCA: https://www.youtube.com/watch?v=HMOI_lkzW08

Silhouette curve: https://en.wikipedia.org/wiki/Silhouette_(clustering)