Load Testing a Recommendation Service

Published in

Bluecore Engineering

9 min readJan 24, 2023

“Cargo ship unloading colorful containers in port”, Kelly via Pexels

It’s that time of the year again. Great deals everywhere, everybody wants to spend their savings, and all retailers can think about is letting shoppers know that they have the best prices of the year. That also means a huge volume of emails being sent every hour, which means a huge volume of requests to the recommendation service for our team. We see a 77% increase in volume during the period of Black Friday and Cyber Monday (BFCM) compared to a regular week, and this year wasn’t any different.

Every personalized email that is sent generates multiple requests to the recommendation service in order to get the recommended products for each shopper. To handle these requests, the service is based in Go and most of the time it will fetch precomputed recommendations according to the parameters, either from Bigtable or ElasticSearch, with an LRU cache for ElasticSearch requests.

Figure 1 — The Recommendation Service will fetch recommendations either from Bigtable or ElasticSearch, according to the requested type.

With that in mind, our main concern is “How many requests can we make to Bigtable concurrently? What about ElasticSearch?” We actually started preparing for BFCM around August, during which we run rigorous tests and identify the biggest bottlenecks and pain points in our system.

The Experiments

Recommendation Service Load Test Tool

For this year’s load testing we used the same tool that was used last year with a few minor updates. The tool is composed of two parts: a server running in the same cluster as the recommendation service, and a client running on our local machines.

The client part functions as a driver, giving the server instructions and settings on which requests we want to make, at what rate, how many concurrent requests should be made at the same time, etc. The server, given the instructions from the client, will download a file with the requests and send them to the recommendation service, following the settings as much as possible. Once the test is over, the server returns the stats and the client compiles them, generating user-friendly logs and histograms. We also apply a stat_fetcher tool to monitor the Prometheus metrics of the service during the test period.

Figure 2 — Multiple tools are combined around the recommendation service to run the tests.

With the load test tool ready, the next step was to define which experiments we wanted to run, and what the input and expected results would look like for them. An experiment can be defined by the following attributes:

A set of requests (or test file)
How long it should take
Number of requests per second
Number of requesting instances (load test tool servers)
Number of threads per server
Expected number of requests (#2 x #3 x #4)
Actual number of requests
p99 of the latency

We can divide the experiments in two groups: Bigtable and ElasticSearch. There’s also one extra experiment with mixed requests, hitting both data sources. For Bigtable we defined the attributes to test with:

64 distinct requests to use as a baseline on a simple scenario.
Over 200,000 distinct requests to break internal caches and simulate a close to reality scenario.
Complex requests only (requests that have a lot of filters or require a lot of products) to see how the system responds under a lot of stress.
The replication of a real campaign that was underperforming to better understand why it was underperforming.

For ElasticSearch, the following tests were defined:

A single unique request to use as a baseline.
Many simple requests (requests that don’t have a lot of filters or only have common filters), as they are the most common type.
Over 200,000 distinct requests to see how the system responds with low cache hit rates.

With all the files created and experiments defined, we are ready to start hitting the recommendation service with all we got. Our goal is to reach at least 50,000 requests per second with a p99 latency of 200ms.

Running Gamedays

Before starting to run the load tests, we upscaled the systems to mimic traffic levels close to what we would expect to get during BFCM. There were between 80–120 instances of the recommendation service running and around 40 load-tester instances. We scheduled the gamedays for times when the regular load was low to avoid interfering with production traffic.

We used a separated shadow deployment of the recommendation service that was identical to production but with a different endpoint. This ensured that we wouldn’t affect regular requests by spamming the service and breaking the cache. However, we still needed to be careful because we were hitting the same Bigtable and ElasticSearch instances.

The first attempts were really frustrating. There were too many knobs to turn. Sometimes the driver file would break the test because of unexpected errors and even when we were able to run a complete experiment, it would not reach the expected number of requests. A few retries, pull requests, and resource adjustments later, we started to see some real traffic going into the recommendation service shadow instances.

We reserved one-hour for each gameday session, just to make sure we had enough time to not just run the experiment but also to downscale systems after the experiment was complete. Every experiment ran for a short period of time, usually around five minutes. In some special cases, we let it run for 15 minutes to get enough traffic to all the pods and to warm up all the local caches. During that session, at least two engineers would stay connected in a virtual room: one of them would be responsible for running the client load tester and collecting the results, while the others would monitor the system metrics to ensure everything was going as expected.

What We Found

Once we got over those initial hurdles, the experiments were really successful. Our target was more than 50,000 requests/second and less than 200ms p99 latency. The simple Bigtable requests experiments showed very fast results reaching up to 120,835 requests/second with a p99 latency of 120ms. The simple ElasticSearch requests were also very successful, going up to 52,399 requests/second with a p99 latency of 100ms while the test with many requests was even faster, getting 112,149 requests per second with 200ms of p99 latency!

However, the complex Bigtable experiments didn’t go well. Just like above, our target for these experiments was more than 50,000 requests/second and less than 200ms p99 latency. The experiment with over 200,000 distinct requests got at most 14,010 requests per second with 1400ms p99 latency. The complex requests test was a little bit better, with 23,938 requests/second and 510ms latency, but that wasn’t good enough. Finally, the real campaign recreation didn’t give great results either, reaching 16,831 requests/second and 410ms p99 latency. During these tests we also noticed some errors reported by Bigtable, meaning that there was something going wrong with these requests.

The results are compiled in Table 1.

Table 1 — Experiment results in requests per second and p99 latency.

From the first rounds of the experiments, we were able to conclude that we were ready for BFCM on the ElasticSearch side, mainly because of our caching system, but not quite ready when it comes to Bigtable requests. So we shifted our focus to figuring out what went wrong in these experiments.

Fixing the Bottleneck

To understand what was wrong in our requests, we analyzed the queries that were being made. We quickly noticed that there was a correlation between one type of recommendation, co-recommendations (co-recs), and the errors in Bigtable. Co-recs are recommendations based on the interaction between shoppers and products, such as products viewed together or products bought together. Even though this is a personalized recommendation type, it takes products as input instead of shoppers’ identifiers, making it slightly different from the other personalized recommendation types. We then separated the requests of this recommendation type and reran the experiments, confirming that the problem was indeed in these queries.

After more investigation, we discovered that the difference in serving this type of recommendation and the rest is the amount of products we need to process during the request. Since co-recs are generated from input products, the service has to fetch recommendations for each of these products, mix them, and return a single set of recommendations, where with other types it usually only needs to handle a single input identifier. This is why it would take longer to process these queries and that it needs to fetch more data from Bigtable. Once a lot of these requests were made, the pipeline started to get overwhelmed and caused latency spikes and the Bigtable errors we saw.

Recommendations are unique which means Bluecore doesn’t serve personalized recommendations for the same shopper more than once in a few hours. Because of this, we never thought about caching them (unlike ElasticSearch based recommendations that are less personalized and thus much more cacheable), but after a brainstorming session we concluded that since co-recs are generated from products that shoppers interacted with (browsed, bought, etc.), it is very likely that the majority of the requests would ask for recommendations for the most popular products, which means that in this particular case, we were fetching the same keys over and over again from Bigtable.

Figure 3 — Co-recs Bigtable recommendations are also cached now.

We decided to test that hypothesis by analyzing the query logs and checking the ratio of distinct input product IDs and the number of requests, and we were surprised to discover that over a short period of time, this ratio was actually considerably low, meaning that many requests were asking for the same input product IDs.

The solution was right in front of us: cache Bigtable co-recs recommendations, as shown in Figure 3. It’s important to note that this assumption was not true for personalized recommendations, so caching them would be pointless and unnecessarily consuming a lot of memory.

Since we were adding another in-memory cache, we also decided to fine tune the computational resources for the service, allowing pods to request a higher CPU limit and increasing the memory available from 8GB to 12GB for each instance.

Table 2 — Experiment results in requests per second and p99 latency after deploying the new LRU cache and increasing computational resources.

Once everything was implemented, we reran the experiments and this time we got amazing results. The new caching sped up the slow requests by reducing the number of calls to the original data source. This caching, along with increased resources allowed us to respond faster, ultimately increasing the number of requests per second of all the experiments, as shown in Table 2.

Experiments that benefit from caching, such as the Bigtable 64 unique requests and the ElasticSearch ones became extremely fast, reaching up to 317,474 requests per second, with only 60ms p99 latency, way more than what we were aiming for. Our biggest surprise was the replication of the real campaign, which would barely complete before, now got up to 258,640 requests per second and 75ms of p99 latency. That’s because this campaign is mainly composed of co-recs, so after a few requests, almost all recommendations were in the in-memory cache, avoiding hitting Bigtable and responding much faster. Needless to say, we no longer experienced Bigtable errors.

The only experiment that didn’t reach the expected performance was the Bigtable one with over 200,000 distinct requests, even though there was a significant increase, going from 14,010 requests per second and 1400ms latency to 40,269 requests per second (287%) and 240ms (583%). This is mainly due to the fact that this experiment is composed of many different recommendation types, so the speed up came from caching co-recs, but we were still making a lot of concurrent requests to Bigtable. Even though we didn’t reach the expected performance, we decided that we would still be fine, since we were aiming much higher than what we expected to see during the peak load, and the cost of improving the performance of this experiment was too high at the moment.

Conclusion

We ran experiments to understand the performance of our system, identified the bottlenecks, and analyzed the usage pattern to reach a working solution. After validating and implementing the solution, we reran the experiments and noticed a considerable improvement in the performance.

November came to its end, and with it BFCM arrived. We saw the volume increase in load, reaching to a maximum of 35.29 million requests in a single hour (an average of 9802 requests per second) and the recommendation service was able to handle all of it with 91ms p99 latency, so it’s safe to say that we achieved our goals.

Tiago Alves is a Senior Software Engineer who has been at Bluecore for a year, mostly focused on infrastructure for data science.

Interested in helping scale our platform as we continue growing? We’re hiring at Bluecore!