This blog focuses on the journey of moving from conventional programming to reactive programming in order to build a High Throughput and Low Latency based Software System. Before directly jumping into the engineering details it will be helpful to go through the following Product requirements, which mandated these software optimizations.
The problem: How do users find out what discount promotions are available to them, specifically for the product they are looking to buy?
The answer is usually: “I received a promotion code (promo code) through a push notification or email”, “I saw a promotion banner somewhere”, “my browser extension software gave me a code”, or “my friend shared a code”. After getting excited and entering the code, they often see a pop-up saying: “This promo code is not valid”, “code has expired”, “you are not invited”, or “this product is not eligible for the promotion.”
We wanted to introduce click to apply (“CTA”) at Groupon with the hope of making the customer experience much better with Groupon thereby reducing cart abandonment and increasing the overall purchase rate.
The CTA feature was run through A/B testing in both North American and International markets. Our customers loved it in all the markets, as they could see the best-discounted promotions they are eligible for at the Product browsing phase and apply it with just a click of a button.
In this blog, we would like to detail how we carried out a series of architectural changes and optimization to support a feature where performance was critical for being production-ready.
The Engineering Challenge
Essentially, we wanted to show the best promo code available to a user on the product details page.
However, working out what promotions to highlight on the product details page is not straightforward.
Firstly, our current system was designed to handle consumer traffic at the checkout stage, which has less than 1/600 of the traffic at the product browsing phase. Secondly, most of the promotions that we run at Groupon are highly personalized, as they are based on each person’s previous interests and purchase behavior, as well as on the product itself. Therefore the promo-code eligibility has to be worked out on the fly. Meeting the required Service Level Agreement (SLA), which was 1/5 of the current response time and at 600 times more consumer traffic would require a major reworking of the architecture.
Horizontally scaling was an option that would add significant cost, and hence we wanted to avoid it. We decided to try using the existing compute capacity to its maximum potential by using Reactive Programming.
The Journey for Improved Performance
Before we jump into our optimization journey we would like to introduce you to the architecture of the promotions platform:
- The Promotions platform is built as a microservice using the REST architectural style. It is deployed on a fleet of VMs with no state sharing.
- It is built on Java, Akka Play Framework 2.6.0 version which inherently supports the Actor Model. Actors are extremely lightweight entities, which utilize underlying thread pools to achieve their objectives. This framework supports Reactive Programming which helped us in our optimization journey.
- The promotions platform persists the constraints, applicable for eligibility of customers and products aka campaign metadata, in the Postgres database.
- Each host had an in-memory cache in Guava, this keeps crucial campaign metadata close to the evaluation engine which gets refreshed periodically. We changed it to Caffeine based in-memory cache. The Caffeine cache was chosen as it provides many useful APIs e.g. key-based TTL, callBacks on cache refresh, etc.
- Some of the product-related data necessary for Campaign evaluation comes from other services. This data is stored in-memory Cache of the hosts as well as in the distributed Redis instances accessible to all the hosts, to avoid frequent downstream API calls to other services.
- We chose Cassandra to store personalized promotional data for millions of customers as Cassandra scales well for this use case at this volume.
There are three major tasks that need to be executed to check the eligibility of a promotional campaign for a given customer.
- All the active campaigns are fetched from the in-memory cache then subsequently we find the campaigns this customer is qualified for, from Cassandra.
- We get additional details for the qualified campaigns from Redis.
- We evaluate the business rules aka constraints of the campaigns sequentially e.g. for each matching campaigns © having business rules (r),
total evaluation steps = c * r
The steps to performance optimization
- Fine-tuned the application threads to support higher throughput.
- Added more VM with powerful configurations.
These optimizations only slightly improved the performance, our tests using JMeter revealed that sequential evaluation mentioned above on step 3 was the major bottleneck.
We decided to parallelize the constraints evaluation using Actor pools with its Scala implementation in the Akka Play framework. A single actor would process only one message at a time and, to make it scalable, we leveraged the load balancing aspect and used Scatter Gather First Complete Routers of Actor pool.
We were able to attain major performance improvements, but we still did not meet the target at this step.
- We used JMX tools viz. JConsole, jvisualvm, and found I/O threads were blocked due to the Redis client library. We implemented Java Non-blocking socket decorators over the standard blocking library used for Redis communication.
- We varied parallelism factors (Size of Thread pools shared by Actors).
Our observation was that further increasing the parallelism becomes counterproductive if the system is fully designed to behave in a reactive manner. After multiple rounds of performance testing, we got the optimal number of thread pools and actor pools. This meant that the scale of 600K request/min(rpm) with Top Percentile99 (TP99) latency < 20ms could be achieved with significantly fewer threads and increasing that beyond the optimal limit only contributed to more context switching.
This feature was initially rolled out as an experiment in North America first and after its success, we further optimized it as we knew that there will be 1000 times more campaigns that will have to be evaluated when the feature is rolled out to all the international markets. Additionally, the requirement changed to find the best available discount for a customer from all the active campaigns instead of the first eligible one. The following steps taken were:
- We moved away from Scatter Gather First, instead used the Smallest Mailbox routing to find the best discount for the customer.
- We used the One for One supervision strategy for Actor pools. It creates a new actor if an actor dies while processing a message, making it highly fault-tolerant.
- We observed that Cassandra lookup latencies were proportional to the number of campaigns. So instead of a single query for all the active campaigns, we broke them up into a smaller batch of campaigns and ran them in parallel, and then combined the results of the individual queries.
We were able to scale our system to provide the same performance for 1000 times more campaigns at the same 600k rpm and TP99 of <20ms.
- All performance tests were done using Jmeter. Jmeter is a highly popular open-source tool used for performance testing. Jmeter provides features to create and execute customized test plans for HTTP requests using both GUI and CLI.
- To monitor the performance of JVM, we used the jvisualvm tool. This tool provides functionality in making a JMX connection to a remote server. jvisualvm has a user-friendly GUI. Critical JVM metrics such as CPU usage, Memory utilization, and thread states are easy to monitor via jvisualvm.
- All tests performed on 12 core VMs, Intel Xeon 3GHz clock frequency, 6000 BogoMips using JMeter.
- The Cassandra key lookup mentioned earlier along with some other sequential operations forms close to 5% of the operations that cannot be functionally parallelized.
- It turns out that it is sufficient to return eligibility from the first successful campaign evaluation and that response need not wait for all actors to finish their evaluation. This is where we use scatter-gather Routing, which allows the calling thread to return a response as soon as the first successful response comes back.
- Most contributions to latencies came from blocking I/O calls earlier. When replaced with non-blocking I/O we saw an empirical cost come down to 40%.
Table 1 shows with non-blocking I/O, CPU usage goes up because threads are not blocked on I/O but are handling more requests improving concurrency through more CPU utilization
P.S: True extent of parallelism is hard to quantify with an increase in the number of requests hence the sequential numbers are used as a benchmark for speedup.
Thread optimization with non-blocking IO
- Making multiple parallel asynchronous Cassandra queries instead of a single query for campaigns, based on Primary Keys, significantly improved the Cassandra lookup latencies. Note: Cassandra version used is 2.2.
Table 2 shows that by providing a large number of Keys in Cassandra select query, the system is not able to scale. Making multiple parallel queries reduces latencies and improves performance. Additionally, CPU usage goes up because threads are handling more requests improving concurrency through more CPU utilization
Post introduction of the CTA feature, we saw a double digits percentage lift in sales. Based on our observations for a user-facing low latency purchase funnel, building features mandate hard performance requirements. Effectively using paradigms like reactive programming helped us deliver the required performance without additional hardware.
Credits: This blog would not be possible without the support of my engineering leaders Gargi Dasgupta and Mohammad Akhtar Ali and also my peers Sudipta Chatterjee, Mahendra Singh, Ramya J.