The Collective Supply: How we use Apache Spark to optimize Agoda’s Inventory

Desmond Chay
Agoda Engineering & Design
9 min readFeb 6, 2023

Introduction

As Agoda aspires to become the largest travel company in Asia Pacific, we seek to scale our inventory and availability for rooms internally and through multi-channel approaches, allowing us to expand our offerings efficiently.

Apart from our dedicated sourcing with individual properties or hotel managers, we leverage multiple channels to broaden our supply by integrating with sister companies like Booking.com and third-party hotel suppliers.

Having complementary offerings and multiple rate channels for the same properties and rooms allows us to give customers the widest variety of hotels and rooms to book from and the best prices for each. This means that even if Agoda’s internal inventory system cannot obtain allotment for a particular hotel A from supply partner X, we can still sell rooms on Agoda from other available partners.

Having more quotes for rooms, from not just Agoda but from third-party suppliers, provides a competitive marketplace where we can source for the lowest price and extend our room availability for a specific room that the customer is looking for.

The Problem

While external sourcing comes as a huge opportunity for Agoda, this inevitably creates challenges with standardizing data input in property and room content from different supply channels.

The big question is, “why do these third-party suppliers not offer room content data consistent with Agoda’s internal standards”? The answer is that different suppliers have the flexibility to name the same physical room differently and can assign these rooms a room name that differs from the hotel’s intended room name.

In the case of Agoda, it would be more efficient for us to develop a mechanism to group similar rooms than to convert data that suits our systems manually. To give an idea of the man effort required to convert such data manually, we currently fit in around nine terabytes of data for a single data transformation logic, which amounts to 300 trillion rows of data needed to be verified.

Room names can often be mismatched on different travel platforms, especially for boutique hotels that use themed room type names, due to differing input standards on various platforms.

Let’s dive deeper into a real-world example to illustrate the problem clearly, using hotel A, a popular boutique hotel in Singapore (name redacted for legal reasons). If we look on the property page of hotel A on Booking.com, we can see that there are very generic room type names for sale, with room types being Standard, Superior, or Deluxe with different bedding options.

Example of hotel A rooms on Booking.com

On the other hand, on Agoda, we see a mixture of generic and exotic-sounding room types, like ‘The Reading’ and ‘The Business.’

Example of hotel A rooms on Agoda

Going back to the source of truth, on hotel A’s official website, we see that room types in an exotic-sounding fashion.

List of hotel A’s rooms on its website

As seen in the snippet of rooms shown on Agoda, sourcing for supply from multiple sources is like a double-edged sword. While we enjoy the benefits of getting a more comprehensive list of rooms for a particular hotel, the consequence of this problem would be that the customer experience would be affected while browsing for rooms to book.

In the above example, a ‘Superior Double or Twin Room’ is the same as ‘The Reading’ by matching the room photos and size. If we knew this automatically, we would be able to reduce clutter on the page, streamline offers to obtain the cheapest prices, and provide a pleasant experience for customers searching for rooms.

The process of matching similar rooms together, however, is not always straightforward. We do not always fetch image content from other suppliers, and the size of rooms can be recorded differently. If we were to match rooms together wrongly, this would lead to undesirable behavior that makes rooms unsellable.

Our Solution

Looking back at the title of this article, you would already know that we have decided to use Apache Spark as a data processing engine to solve the problem by matching similar rooms in an automated fashion.

Apache Spark makes it simple for us to overcome the technical challenges we face in solving these business problems by providing us with built-in APIs and processing capabilities, namely;

  1. Scalability: As a fault-tolerant distributed processing engine, we can process a huge amount of data in a single run and accommodate increasing amounts easily simply by increasing the resources (executors, memory) we use. (thank you, implicit data parallelism). The built-in web UI by Spark also allows observability of performance metrics and bottlenecks, allowing us to tune jobs easily when needed.
  2. Lambda architectures: The Structured Streaming component allows us to extend previously written batch jobs in DataFrame APIs and easily convert them into a micro batch-based process. Any code written previously could be wrapped within a forEachBatch abstraction if there were no event-time window operations to be concerned with.
  3. Graph-parallel Computation: GraphX, a component of Apache Spark, allows for in-memory processing using abstracted graph data structures and provides an optimized variant of Pregel that supports iterative graph parallel computation.
Spills & Skews in Apache Spark

Apache Spark is widely seen as a big data processing engine. If we estimate the possible number of physical hotels and different room types worldwide, it would be around 2 million hotels and 20 million room types. This seems like a trivial number in comparison to running analysis to run computations on ad impressions on Google, which can amount to trillions a day.

However, let’s think about the number of suppliers we can connect to and the business problem we are trying to solve. The scale of the problem presents itself as having to match the same physical hotel or room from different suppliers as a unique entity to represent one physical hotel or room to Agoda’s customers.

To do that, we would have to have an n² combination of possible hotel-hotel or room-room matches to choose from if we do not consider other simple discriminators like geological location that can help to reduce the data involved to be considered. Any increase in the dimensionality (features) we use as inputs to a machine-learning model would also increase the data size accordingly.

Possible room names to match a room displayed to customers.

Other factors that would necessitate re-processing of hotel/room matching would be if there are supply issues with rooms that render them unsellable, or if the content provided by suppliers changed, potentially rendering them invalid matches.

Given the above-described methodology of matching rooms (and hotels that hold these rooms), we currently process more than 35 trillion rows of data daily for hotel matching as input to our machine-learning model on a typical configuration of 64 executors with 32 gigabytes of memory each.

Using Apache Spark, we can scale easily with any increase in data, as we would only need to increase the number of executors and allocated memory if we wish to process more data in memory. If enough computing resources were allocated, there would be no upper limit on the amount of data that can be processed, only a matter of inefficient processing out of memory due to data spills.

The Methodology

To briefly explain how we match hotels or rooms, we use a machine learning model built and trained in Spark to predict pairs with a high chance of matching. We also use a handmade regular expression database to filter pairs that violate our business requirements and standards.

We decided to use decision trees as a means of a supervised learning algorithm that can benefit from internal manual verification of our matching results. Such data that have been manually verified may be used as part of a feedback loop to continuously retrain the machine learning model used as both model and data drift (where new data is added) happens.

As a result of our algorithm, where we generate multiple possible rooms for a specific room to be matched with, this results in another problem where we can have chain matches, which prevent rooms from being displayed correctly on the UI (user interface).

For example, room A can be matched to room B, which has previously been matched to room C. If there is a chain match, given room A -> room B -> room C, both the frontend and the backend services will face issues to resolve this chain such that the rooms will be displayed on UI correctly where only room C is showing. The price rates for rooms A and B are correctly integrated into room C.

There is also the dilemma of deciding which content to display on the UI. Going back to the previous example of hotel A, how should we decide whether to display `Superior Double or Twin Room` or `The Reading` as the room type name that customers should see? (Trivia: The answer to this is that we decide that the content of our inventory should be preferred)

Internally, we refer to this problem and solution as ‘flatten,’ akin to folding a chain of vertices until they only can have many incoming edges or only one outgoing edge.

Measuring Value

Finally, to end this article, we have discussed how this data processing would give users a better experience through less clutter on the UI, better rates, and better inventory breadth.

However, it is difficult to measure the incremental value in the data processing work we do directly through analysis since the matching done for all hotels, rooms, and bookings in a day can vary widely.

To measure the value of our room matching process, we once attempted to run an A/B test, where on the A variant, we would serve rooms as it is (where there would be numerous duplicate listings from many suppliers).

The B variant would have our data processing changes integrated into it. Because of performance concerns on request time (imagine the number of rooms displayed to customers on the A variant), we ran the experiment only for a small fraction of our daily traffic. The B variant proved to be a big win as measured on our experimentation platform.

If we were to serve all rooms instead, assuming our team does not exist to match rooms together, the frontend client would receive many orders of magnitude more data in the response, resulting in increased infrastructure costs and the risk of high latencies and slow response in webpage loading.

What’s Next

Data processing to resolve the content of properties and rooms from many suppliers is only part of the problem. In Agoda’s bid to have the best prices, inventory, and availability amongst our competitors, we face other problems, such as scaling content ingestion from suppliers, receiving up-to-date prices, availability from suppliers, and managing booking confirmations across systems.

In the next part of this article, we will discuss the initial DataFrame approach and how we rewrote the code using Apache Spark’s GraphX to save processing time for the same amount of data.

--

--