Scaling our customer review system for peak traffic
Abstract: Customer reviews is a high-traffic system, which requires scaling to meet peak usage times. Our scaling solution? A consistent hashing algorithm that allowed for scaling without removing any of our availability zones from receiving traffic. We also optimally utilized our hardware in the process — all with no noticeable impact on users.
About the article:
Reviews on Booking.com are essential to our guests to make their best possible decision when selecting a property to book a stay — and any disruption affecting reviews results in a poor user experience. To help ensure continuity, we increased the hardware that runs our review backend system by 50%, and nobody noticed those changes were happening in the background. Here is how we managed those changes without affecting our customers.
Booking.com’s review system is a core part of our platform. It is one of the most important decision-making criteria for users searching for accommodations, and is one of the motivators of guests to use our platform. Reviews are also highly authentic, guests that did not actually book and stay at one of Booking’s properties are not invited to complete a review. Additionally, reviews are based on scoring, multiple choice questions, photo uploads and textual feedback. This gives our guests extensive tools to extensively evaluate their stays.
The back system that provides our review data across all parts of the Booking.com platform is one of our critical systems. High availability and low latency is essential, as reviews are widely used across the booking funnel to engage our guests with the platform. This means the backend has to handle tens of thousands of requests per second. To achieve this scale, and with a p99 response times less than 50 ms, we serve all traffic using prematerialized and cached data. The system serving those reviews enables several use cases from basic lookups to more complex filters, sorters and aggregation across a multitude of accommodations.
Reviews contain both scoring numbers across multiple criteria and information like textual feedback. A single review consists of data we need to aggregate across multiple databases and services. This process is called materialization.
We are talking about more than 250 million reviews
Given the sheer volume of data, storing this data in memory on a single machine was never a viable option. Instead, data is partitioned across several shards, running on bare metal machines. To prove against extreme cases like hardware failures or network problems, we also run replicas of the shards. This entire system is replicated across several availability zones. In the event an availability zone is lost, another zone will take over and users will not notice the disruption.
The challenges of scaling
At Booking.com we have tooling that automatically runs capacity tests against our services using live traffic. Those tests generate forecasts based on historical traffic and provisioned capacity. To predict upcoming traffic, we routinely check and consult both these forecasts and relevant business insights.
Last year’s forecast project that review system traffic would push towards the limits of our current capacity. A boom of traffic was expected during summer, not to mention it was the first peak season post-pandemic. Therefore, we expected some of the highest traffic ever. This expectation was the call to action for our team to look into those predictions and come up with a scaling strategy. A break in continuity of our service would have a significant impact on our platform. It was extremely important for us to be well prepared for the impending high traffic.
Irrespective of the capacity forecasts, we also wanted to have enough capacity for future innovations (e.g. new product launches that could add further load on the existing system).
As previously mentioned, the system relies on sharding review data across several nodes.
Sharding is typically based on a field (partitioning key) that will be used to decide to which node a record belongs. In our case, we chose the internal ID of an accommodation (i.e. properties, hotels, homes, etc). This means whenever we load a review into our system, its property ID will decide in which shard it will load. The most basic way to do this is to use a modulo operator on top of the ID. In our case this would be:
ACCOMMODATION_ID % NUMBER_OF_SHARDS
Here are some real examples:
// | Accommodation ID | Number of shards | Shard ID |
// | 12 | 3 | 0 |
// | 13 | 3 | 1 |
// | 10124 | 3 | 2 |
// | 10124 | 4 | 0 |
How do users know where they can find the reviews of interest to them? A routing layer, or independent coordinator service, takes requests from users and fetches the appropriate data from the responsible shards. This means whenever a user requests reviews for a particular property ID, the routing layer will apply a hashing algorithm (which yields better distribution than a modulo operator) to resolve the shards from which to fetch the actual data.
Looking at the problem at hand, a naive approach would be to simply add the new hardware and let the sharding do its job. (This is commonly referred to as resharding).
Resharding would work, but has a fundamental problem: it requires a shutdown of one availability zone while the data is redistributed.
This is not a great scenario to be in.
The reason that resharding is not trivially executed on a running system: changing the number of shards means the keys will be redistributed across the remaining shards. This means if the routing layer tries to resolve the location (the shard ID) of a given accommodation during the ongoing resharding it would not be able to resolve the location. For example, given 3 shards and an accommodation of ID 10124 would resolve to shard ID 2. If a shard is added, the location of the key would change to shard ID 0. During the process, it would be ambiguous as to whether the accommodation ID of 10214 would be placed in shard ID 2 or shard ID 0.
One potential alternative strategy is to simply spin up an independent cluster to handle the resharding, and later replace the existing cluster. Since this approach requires more hardware than the hashing algorithm, it was deemed too expensive.
We knew then that we needed a solution that would satisfy the following criteria:
- No need for more than the bare minimum of hardware required to scale
- No noticeable negative impact on users during the upscaling
- Possibility to incrementally move to the new setup and further being able to fallback at any given time during the process
In order to easily scale the system, we build the service to rely on a particular hashing algorithm  that not only ensures that randomly distributed data (i.e., balancing), but further allows for scaling in both directions in such a way that only the bare minimum of data has to be moved around shards.
This means we can add the necessary hardware to our datacenter and start populating those machines with their respective reviews. The property that makes it possible is referred to as monotonicity. Monotonicity means that data can only be transferred from old shards to new shards during an ongoing resharding. This implies we can safely assume that any redistributed key will either stay at its current shard or will be moved to the newly provisioned hardware.
In addition, the algorithm is fast (nanoseconds) and has no additional storage requirements which makes it suitable for our requirements.
Setting up new hardware
The first step after choosing the solution was to provision the new hardware. We opted to increase the hardware capacity by 50% based on our prediction. The newly provisioned hardware was set up to use the new sharding strategy (this means the new shards assumed they were running within a cluster of 50% additional shards). This also meant the existing shards were not aware of the new shards and kept working as before. The newly provisioned shards were starting to populate their data.
Running both sharding mechanisms in parallel
One of the properties of consistent hashing is monotonicity, which says that when the number of shards is increased, keys move only from old shards to new shards (no unnecessary rearrangement). This allows us to consider shards that are part of an N-shard scheme (i.e., old scheme) to be also part of an (N+M)-shard scheme (i.e., new scheme). Until the new shards have fully loaded the reassigned keys, coordinators (nodes of our routing layer) are only aware of the old scheme and route traffic only to the old shards. After this, we can selectively make some coordinators aware of the new scheme and monitor the traffic.
After new shards have loaded reassigned keys, some of the coordinators were made aware of the new sharding scheme. Gradually, more coordinators became aware too. At any time we still had the option to simply fall back to the old sharding scheme and completely discard the newly added hardware. Luckily we did not have any issues during the process, so we could proceed until all coordinators were aware of the new scheme.
Going “full on”
After we started routing all traffic to the new scheme, we updated old shards to stop loading keys that were reassigned to new shards.
Observability is key
Observability of our system is one the most crucial parts of our work. We spend a lot of time making sure our monitoring is up to date, reflects the current state of the system and that we have meaningful dashboards in place. We have alerts configured on all critical metrics. Building and integrating instrumentation into our applications is essential for us, and is particularly relevant for any performance improvement or scaling. Without clear insights into our review system we would not have been able to properly predict hardware requirements. Also we would not have been safely able to execute the scaling of our system.
Plan in advance
When designing systems that most likely will need to scale in the future it can save a lot of headache making sure that horizontal scaling will be easy to execute. In our system’s case, it was great that years ago the teams decided to go for the Jump hash sharding algorithm. At that time it did not add any implementation overhead, but made our recent upscaling much easier. Also it saved a significant amount of costs on hardware and made the overall process safer.
Have fallbacks ready
Things can, and will, go wrong. Assuming and planning for failure is necessary for any critical system. It was critical to have the ability to revert our changes at any given time in the process. So whenever we introduce bigger changes, we attempt running the old and the changes systems in parallel until we can safely drop the old system. This approach has several times already saved us from causing unpleasant experiences for our users.
 A Fast, Minimal Memory, Consistent Hash Algorithm by John Lamping, Eric Veach: https://arxiv.org/pdf/1406.2294.pdf