The Epic Preparation of EPIC SALE

Puspacinantya
Traveloka Engineering Blog
6 min readJul 8, 2019

Let’s be real. No matter how much money you have, you are always going to love good deals. Purchasing things at a cheaper price always feel like a victory to each one of us, and Traveloka knew about it. As the staycation and leisure trends prosper, the company decided to show its commitment to facilitating customers’ travel experience through the EPIC SALE.

You might be thinking about what EPIC SALE is. Here’s a brief explanation: it is a four-day program of Traveloka’s great promo in various products, such as flight and accommodation. Customers can also hunt a better price on epic hour every 6 to 7 PM, starting from April 24th to 27th, 2019. This epic hour was a real deal, as it served up to 80% discount for numerous hotels around Indonesia and Southeast Asia region.

Can you imagine how many visitors that were drawn to purchase — or at least, access the site? The numbers were even higher than we expected it to be. Now the question is, how did we manage to handle the extreme peak of traffic without failing? How did we ensure that this program does not turn into one big, historical chaos?

Stress testing

When intense loads are coming in, we want to make sure that our services are able to accommodate the traffic. This is why stress testing is needed.

Traveloka relies on estimation in predicting the number of visitors. Once we received the number from business unit, we have to check if our system can handle the sudden increasing traffic. At first, before we even started EPIC SALE, for hotel search API alone, we expected to gain 7x more traffic than our baseline. We did a scalability analysis — a methodology to assess, identify bottlenecks, and improve the scalability of the system. Assess, in this matter, is to know the performance baseline of the system. After we finish with assessing, we move on to identify any bottlenecks. One way to identify it is by using a stress test runner.

There are actually many stress test runners available on the market, but since our test server is located in a VPC (Virtual Private Cloud) and is not accessible from the public internet, we have to build our own framework for stress test. To simulate the real traffic, we will need to have a distributed stress test runner, since we know a single server will not be adequate to simulate our load. The next question will be: “How many servers do we actually need to simulate the load?” It depends on the maximum number of requests each stress test server can generate. We can deduce it to get the number of servers we need in order to generate targeted traffic.

Our platform is built on AWS and thus you will see many AWS services being used in Traveloka, such as S3 and Lambda. These two services will also be used in this testing process, together with JMeter.

The complete flow of our stress testing mechanism.

We only input two things into S3 to do a stress test: the input data (in JSON) and the test script (in .jmx format). In this flow, Lambda helps us in creating multiple JMeter machines to run the stress test. With Lambda, we are able to create a function that determines the number of machines needed. While the stress test is still ongoing, we monitor our Datadog dashboards to get insights on the system metrics (such as memory and CPU utilization, latency, error rate, and JVM memory pattern of our services). Once the stress test is over, the machines will give back a report to S3 and the machines will be destroyed by a Lambda function. There will also be another Lambda function that sends Slack notification after the testing is done.

By monitoring our system performance, we can easily spot that some engineering practices inside the team were not optimized well, and thus needed improvements. This testing opened up opportunities for enhancement and let us know where the system lacks.

Then, what should we do to enhance the system?

Looking back at the code and optimizing them

Sometimes it is good to look back on how far we have become and to see if there is any room for improvement. We managed to monitor and evaluate our code again and see where to optimize. The evaluation works by finding which part of the code that might be problematic. The next step is verifying right away if it is actually problematic, and when it is fixed, can be steady without any disruptions.

The evaluation resulted in multiple revamps on the application. One of the revamps that we did were associated with calling the rate structure for hotels. As we really look into the code, we figured out that we can improve it. One way to do so is by implementing LRU caching scheme. The idea of LRU cache is to remove the least recently used data to give room for the new ones that are more frequently used. This resulted in a much better and improved performance when calling the hotel’s rate structure.

We also used to have a single big cache for any kind of data. This resulted in a busy cache operation since all data requests will come to this cache. We dismantled it and created separate caches for different types of data.

Another problem that we faced was the chatty I/O antipattern. It basically means that one API request can generate a large number of I/O requests to multiple backend services. While each I/O request has an overhead by itself, a large number of I/O means it will accumulate the overhead and slow down the overall performance. We solved it by batching the I/O requests, so one API request will at most generate one I/O request to each of other backend services.

Can you see how influential stress testing and monitoring are up until this point? Just like what Imam Habibi, one of the lead engineers in Traveloka said, monitoring is essential in preparing the readiness of EPIC SALE. If we did not look at the result of stress testing and monitoring, we might not be able to know where to optimize, resulting in poor performance when faced with high traffic.

The execution

On each day during the EPIC SALE, we gathered together in order to prepare for the EPIC HOUR — the most intense 1-hour moment, not only for engineers but other teams as well. Since customer’s enthusiasm on this program grew significantly higher every single day, on our last EPIC SALE day, we received almost 12x traffic than our baseline. It was way higher than we thought it would be (we only stress tested the system to 10x traffic), but we were ready. Thanks to our ASG implementation that worked wonders and the teams’ instant response in mitigating the problem. The obstacle only lasted for the first five minutes. Our engineers even managed to do a group picture in the war room with many other EPIC SALE heroes, like product managers, analysts, operations, and marketing teams. A victorious moment.

The brilliant minds behind EPIC SALE.

If solving difficult problems in a fast-growing tech company sounds a lot like you, join us and help millions of users to discover their next big trip and experience with Traveloka! Check out our career’s page to see the available opportunities. A thrilling adventure awaits!

--

--