Utilizing Amazon DynamoDB for Live Bidding System: A Case Study of Blibli.com
Warning! Please read this first: Blibli.com Bidding System
This article is created as many readers requested and curious how Blibli.com Backend Bidding System works.
Over the last few years, there’s a rising trend in Indonesia’s e-commerce platforms to capture new and loyal customers to their platform. We know it as “Flash-Sale”, where we put selected products on a limited number of stocks for a limited time with attractive prices attached to it. Here in Blibli.com, we had ample experience in managing a flash-sale campaign. Last year, our CFO challenged us to create a flash-sale with a different approach: a combination of auction and flash sale.
The concept is pretty simple: we give our users a selection of items that they can buy in the flash sale, but with a meager starting price. The main course (a Mini Cooper) starts at 1 USD, but some other items could have a different starting price. The users then could place a bid by clicking on the items they wanted. Every click will increase the item’s price by a certain amount, and when the timer ends, the last user who successfully placed a bid will be the winner. The user had to pay the amount s/he bid.
Moreover, the auction will be broadcasted LIVE on a national TV channel.
Uh-Oh!
We found one problem we have to tackle first: A HELLUVA LOT OF BIDDERS on a single session of the auction. This is pretty much how most developers in our team reacted on the news:
Since everyone was shocked and frightened by the scale, we began working hard to find a solution.
What We Have
Back then, our system was built entirely on-premise. We have both Mongo and Postgre, but we didn’t have any experience in setting up a scale-out writes for them. We also didn’t have any auto-scaling in our environments. While all of our deployments are already automated, we can’t easily build and scale the environment quickly yet because of the traditional VM infrastructure we had.
All of that, in addition to our previous struggle in handling an extreme sudden traffic spike, left us wary with this next challenge. Yes, we said that we had ample experience in managing a flash-sale, but not like this. Damn, we’re screwed!
DynamoDB to the rescue
After a lot of thought, brainstorm sessions, explorations, and prototypes with varying degrees of success, we remembered that we had some unused old apps on AWS (even the EC2 was not running). The team that built that apps were then summoned to share their insights. Why did they choose AWS? What kind of architecture did they use for the apps? How is the performance?
After a lengthy discussion, we made the decision to use AWS stack, specifically DynamoDB to handle the data side of this flash sale. We mainly wanted to focus on data first since we were sure we’d need an insane level of writes for this flash sale.
Implementation
What we do know at that time was that AWS DynamoDB is a highly performant database. However, the cost is too high as the pricing model is calculated based on reading and writing’s call requests. Based on our requirements, we need to avoid always calling to the database when data hardly change. We chose to put Redis in front of DynamoDB as a caching layer. That way, we can reduce the cost at the expense of more time to develop.
Wait, WHAT?!
Without further ado, we divide the task within the team. Some worked with the core business logic, some worked on performance, and others researched ways to simplify the work and cost. Amidst all the fun (read: chaos), our CTO invited some of his colleagues who worked as AWS Consultants.
The consultants also started to look into our design and architecture. After a bit of back and forth, they concluded that we could improve our design without using Redis for our specific use case.
Our first thought was, “Wait, WHAT?! Are you telling us not to reduce the cost whatever it takes?” We remembered our previous calculation of the estimated value, and everyone suddenly got a headache. Some had a severe, terrible one. “This would be a lot more expensive,” we told ourselves.
On the other hand, though we were curious about their solution. So we asked what they have in mind.
Oh!
Lo and behold, they suggested we use a more direct approach for DynamoDB, by using one of its features, DynamoDB Accelerator (DAX). Their introduction of DAX was quite short, but we were hooked. By then, apart from the team that worked with the core business logic, everyone else started to dig deeper about DAX: trying to understand its caching strategy and behavior, how to set it up for our own use case, and also the most important thing, the pricing model.
The deeper we dug, the more we found that DAX offered more advantages than disadvantages. First, the API itself is the same as the basic DynamoDB API as long as we declared that we are using DAX in the bean creation. Thanks to our passionate team member, Indra, who found this.
Let us share more about it here. Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB. As it operates in write-through mode and is compatible with existing DynamoDB API calls, our application interacts with DAX as if it interacts with the DynamoDB directly with the same API calls, and DAX transparently handles whether to fetch data from DynamoDB or serve from its cache.
This fact made the only thing we need to change to use DAX is adding an if statement to the bean creation. Even then, the logic and method names are precisely the same, with only different objects required:
Second, the pricing model is ideal with what we need, as DAX was priced based on VM uptime, not per call requests. Lastly, there are community library support for it that ideal with our tight deadline.
From performance and cost-optimization perspective, we also leveraged Amazon DynamoDB On-Demand, which is a flexible new billing option for DynamoDB. For tables using on-demand mode, instead of previously having to define Write Capacity Unit (WCU) and Read Capacity Unit (RCU) or the Auto Scaling policy, DynamoDB instantly accommodates customers’ workloads as they ramp up or down to any previously observed traffic level. If the level of traffic hits a new peak, DynamoDB adapts rapidly to accommodate the workload. It adopts pay-per-request pricing for reads and writes requests so that we only paid for what we use, making it easy to balance costs and performance.
After some more rounds of experimentation and iteration, we also found out that we don’t need that extreme level of writes if we updated some of our logic. We instead rely on Redis for caching, coupled with optimistic lock.
Here’s the short version of our final process:
To minimize write to the database, we utilize Redis as the protector of the bid. Every time the system receives a bid, it will first check to Redis for an atomic value that functions as locks. This value will only be atomically incremented after we get a valid bid. When several bids come in at the same time, the first one to get the value from Redis will be the valid one, and this first bid will get through DAX, and the rest rejected as an invalid bid. This way, no matter how many bids we got in the same time frame, only one will get written to the database.
The details of this mechanism deserve its own detailed post so that it will be a story for a different time, but if you can’t hold it in any longer, feel free to ask our team member, Felix Wimpy :)
The D-Day
With a lot of effort from everyone, we managed to finish the project just in time for the flash sale. The show is live on air!
All of us who are working on this project were anxiously watching from our office. Then the first bid season started. The bids just kept pouring in faster than we could imagine.
To our surprise, nothing happened. There were no timeouts, no crashes, no user complains! We continued to monitor the system meticulously until the TV show ended, but nothing much happened other than the usual stuff you’ll find in big promos like this (e.g., bots, fraudulent users). The system held its own.
Closing Statement
Building a system that could handle a huge spike and scale fast is never easy. You will always need to do a lot of experimentation and utilize every tool at your disposal. For you to be able to do that, you’ll need a tool that’s flexible and could give you many choices at the same time. For this specific use case, AWS gave us a tool that is flexible and has an exhaustive list of services that you could exploit. You should look into the tools at your disposal and see if there is more of it to utilize.
Should you plan and execute accordingly, it is always possible to have a smooth and 5xx free promo and flash sale! We even handle up to 10 times more of our usual active user transactions with a lot of new users without a hitch!
PS. Are you interested in knowing and working directly with this kind of challenge? Do you have an idea about how we can improve our current system even better? Feel free to talk to me or even join us and fix it!