The Hotels Network goes real-time
It all started with a bold idea, as it so often does at THN. The benchmark charts were laying there, static, waiting for midnight in order to be updated. The product was already powerful and reliable, praised by small hotels and big chains alike… but it could be better.
We wondered, what if data were ingested and analyzed in real-time? This would definitely be a plus for hoteliers, who could be following their results live. How are we performing this Black Friday? How is the network performing? Are the new rates having an effect on bookings? Or the new advert on searches? Not to speak of the new possibilities this could unfold.
Ok, let’s do it, we said. And we did it.
The event-driven approach
The first step towards unlocking real-time capability is adopting the company-wide policy to record events as they happen. A new visit has been detected, a new account created… it all must be logged into a topic to be later consumed by whichever party can be interested. This is a much better idea than relying on cross-platform ETLs that tend to be built on a per-request basis and usually end up with redundant flows, tangled schemas and data disparities.
Once the relevant data is available in a topic, we can begin our work.
Building the flow
In our case, data starts in different topics. One for visits to the website, one for price searches, one for confirmed bookings… In order to ingest them into TinyBird, our analytical bedside database, the initial idea was to do it separately, but this led to multiple inserts triggered (one per topic) not taking full profit of the micro-batch dynamics going underneath. The alternative consists of aggregating them into a single topic, keeping a label to indicate the object type (visit, search, booking…) and a free field to receive all attributes. The advantage to dealing with semistructured data at this point is that every record can bear only its relevant fields and not abide by rigid columnar data structures. Once data gets materialized and precomputed, we will worry about filling up columns.
In addition to this, to obtain the best performance in an analytical database, data needs to be denormalized. This means that it is preferable to have redundant data in all tables rather than relationships that would need to be computed at execution time. For example, with a booking ID you can infer which visit made the booking and with the visit you can reach information about the visitor and the country they were visiting from. However, to obtain fast queries we will need country information in both the visits and bookings tables.
In order to denormalize data, these joint operations need to happen, obviously, in real-time. ksqlDB has proven to be a simple yet tremendously effective tool that helped in this regard. Copartitioned streams of deduplicated data will be running in parallel, matching records and joining data before herding it to the big Events topic that will be ingested into TinyBird.
Nice, we’ve unlocked real-time. What now?
The Lambda architecture
Now that we’ve reached this point, what do we do with the reliable system in place that computes information at midnight? Throw it away and put all of our eggs in the real-time basket?
While this could be an option, we decided to go with the so-called Lambda architecture. This approach tries to keep the best of both worlds: combines the reliability and precision of a batch process (last what you need, but gives me exact results) and the immediacy of real-time (can accept a bit of an approximation, but it has to be fast).
So, the idea is that the old system will be kept in place and the real-time will be added as a plus to cover the present day’s data.
This even brings a new perk to the table: in the event of an error in the midnight computation, streaming span can be expanded to cover two or three days ago while the batch data is rebuilt. And vice versa; if topics were to go down for whatever reason, the batch process will ensure data is up to date, at least, until last midnight.
Conclusion
In the fast-paced world of big data, staying ahead is not optional, it’s mandatory. The reasons why information shouldn’t be up to date in real-time don’t hold the way they might have a few years ago. It is not a trivial thing to do, but once you understand the fundamentals behind it, it comes as a natural way of doing things. After some good hard work, BenchDirect’s charts now evolve in real-time: goal accomplished.
Moreover, as one could have anticipated, with this development soon became clear that other improvements in related tools have been enabled: the visits to the hotel website can now rely on a faster and wider range of variables to better personalize the experience.
Let’s go for it!