How Event-Driven Architecture and Microservices Help Us to Scale to the Next Level

Published in

Traveloka Engineering Blog

6 min readFeb 28, 2020

Editor’s Note: This blog post is written by Felix Perdana, the Engineering Manager for the Issuance and Post-Issuance (IPI) Team. He will cover a number of topics in the realm of Issuance Services, ranging from their importance, the struggles that he and his team encountered and addressed, as well as the improvements that they (continue to) bring about to strengthen this vital business service for Traveloka.

The IPI team is a branch of Traveloka’s platform group, catering to the needs of all of the company’s products vertical, especially in the field of Issuance and Post Issuance areas such as refund and reschedule.

Introduction

In this blog post, we are going to address one of the oldest services in Traveloka that handles one of the most vital flows in our business — Issuance service. We will cover why this service is so important, the problems that the legacy system holds, complete with its architecture (simplified version), the new architecture (also simplified) that we chose along with the reasoning, and finally, how the migration to this new system benefits the company in achieving the next level scale. Not only that this initiative solved the technical part, it also helped boost the customer experience in which we will cover more in Bonus point.

Importance

Users’ transaction journey won’t be completed until they receive the products they purchase. Compared to other services in Traveloka, Issuance poses another level of chaos and madness when it fails. Failure in other flows — search, booking, payment, etc — indeed would cost the company some transactions. However, the customers would probably come back later on in normal cases.

On the other hand, a package of additional problems would come along if the issuance service breaks down. Some of them are:

Losing credibility — Customers have booked and paid for it. Delay or failure in delivering what is expected is certainly the last thing we would want as a customer-centric company. In certain cases (which usually are time-sensitive), the customer might not get the specific inventory anymore due to the high demand (e.g last-minute bus ticket, concert, etc).
Financial loss — Changes of price, upgrade services, reimbursement, etc when the expiration of booking time limit is reached.
Spikes of calls to customer service demanding explanation.
And many more.

The Struggle

Depicted above is the old architecture of the issuance system before the migration. Here’s how it works:

Payment service will verify whether a booking has been fully paid. After the payment is confirmed, the service will update the booking_status to “PAYMENT_VERIFIED”.
A scheduler living in the old-issuance-service will periodically check the database with a query similar to:
SELECT booking_id, booking_type FROM booking_db
WHERE booking_status=PAYMENT_VERIFIED
LIMIT 100
The old-issuance-service will run some common issuance logic before dispatching a more product-specific issuance job to the respective product domain (flight, accommodation, etc) aggregator based on the booking_type.
Product aggregator service will then connect and submit the inventory-issuance command to 3rd party (the inventory owner).
Product aggregator service will then update the booking_status to “ISSUED” so that it won’t be picked up again on the next batch.

Now here’s where the problems lie:

There is only 1 instance in the old-issuance service. Having multiple instances would cause a race condition in the schedulers and thus impacting duplicate issuance. Adding a flag so that the same booking won’t be picked up simultaneously is also not preferable as it requires to modify the database.
Having only 1 instance also means that every time we push changes to production, there would be a downtime of around 5–10 min.
Not scalable for bigger traffic.
Lots of unknown logic from all products are dumped there. Codes are chaotic and you can imagine the trouble when someone needs to modify something there.
And many more madness that cause sleepless nights.

The Enlightenment

Based on the problems, the requirements of the new system can be simplified to 3 things with performance as the bonus after we migrate it:

Highly available
Reliable
Scalable
Faster (bonus)

Below is the new architecture for the Issuance Service that could achieve the requirements above.

It doesn’t look so much different at a glance, but if we identify closely, here are a few things that change:

We use a queue to start a job instead of relying on changes from the database. Payment notifying the issuance (from the old architecture) is basically an event-driven mechanism. So we just need to use the correct tool for it without incurring any major changes.
By going with this method, we are able to achieve a couple of things. First, we free up some load from the database, and second, we could disregard the needs to modify the existing structure of the database thus improving availability, scalability, and reliability.
Since the queue that we use delivers at least once, we need to add a simple duplicate checker to ensure that no same jobs will be executed. In this case, we use dynamoDB for simplicity (easy to deploy, managed service, low cost).
This dynamoDB also provides another functionality for throttling mechanism. If one or more products are behaving unexpectedly, we could throttle the issuing mechanism for that particular product so that it does not impact the other. This new mechanism also helps in improving availability.
Also notice the direction of the arrow from the “New Issuance Service” to the “Message Queue”. Not the other way around. We chose to have the jobs polled from the queue instead of having them pushed by a Notification Service. The reasons were because we wanted to avoid the characteristics of Fire-and-Forget-like from the Notification service and have a better control of the message flow (retrying process, ensuring that the message will never get lost). The cons? The process isn’t going to be executed in real-time. However, being able to control the polling time to as low as twice per second was already more than enough for us. So it’s okay for us to sacrifice a bit of velocity for a much better reliability.
In addition, we also refactor the issuance service to be stateless (compared to stateful service before) so that we can scale the instances as needed. Improving scalability.
Part of the refactor is also removing unused logic, moving codes to appropriate services. Make this service as thin as possible, adhering as much as possible to the microservice and single responsibility concept. Improving reliability.

Bonus point

Since we have more instances and leaner processes now, we thought that maybe we could do something to improve the speed of the issuance too. We utilize the number of workers based on the number of instances as well as optimizing the delay to initiate a new issuance batch. We improve this job picking process from once per minute (due to the limitation of the load that the database can handle) to once per 0.5 sec (a 120X increase!!).

The impact is immediately felt by the customers, especially for those kinds of inventories that usually need a fast issuance — attraction/movie ticket, eats voucher, last-minute hotel check-in, etc. Before, they need to wait for a considerable amount of time before receiving their ticket, now it is almost instantaneous. This bonus impact makes the team feel proud with this achievement.

The Impact

Not long after the refactor and the migration was completed, Traveloka was trying to have its flash sales again for the first time ever after this kind of activity was discouraged for several years. The scalability of this issuance system was indeed an important factor in supporting this initiative. Even when it received a boost of 10X traffic at one point in time, the system could still hold up.

Moving forward, Traveloka is becoming more flexible in determining its marketing strategy without having to worry about technical aspect in one of its most critical flow. The system is expected to be able to bear 20X or even 50X spikes with just a little configuration.

If you are into this kind of challenge — unleashing the company’s potential, being a part of the team that can bring the company to the next level — or able to see that there are things that can be improved, feel free to reach out to us and we will be more than happy to get back to you. Or if you would like to join our team, visit Traveloka’s Career page to see the opening opportunities.