System Design: Amazon Flash Sale!

umang goel
9 min readJun 3, 2023

--

Problem statement:

MyMobile is launching a new flagship phone today that will be available to purchase exclusively on its site starting 10 AM. This is an awaited model, and it is expected that whole inventory might get sold out in a span of 15 mins only.

Now the sale starts and there are large number of people who were waiting in front of their computer screens to click the buy button. But wait somehow users are seeing the error page on the website, what just happened? The website was not able to handle the sudden surge in traffic and it crashed, leaving the users in frustration!

This is a very common problem in the cases of flash sales where there is very high traffic surge is expected as soon as the sale starts and this needs to be handled effectively.

Requirements

Above example is a typical use-case of flash sale. This type of the sale has to be handled separately and has following requirements:

  1. Sudden spike in traffic in a very short window of time.
  2. Items in inventory is usually much lesser than the demand. (There can be only 10k items for sale for millions of users)
  3. System should be able to handle the concurrent read writes effectively so as to give users a good buying experience.
  4. System should ensure that there are no over bookings or under bookings for the item. Inventory that is in reserved state should be freed to be rebooked if the purchase has failed.
  5. Bookings needs to be completed in real time. We cannot keep the users guessing if they will get the booking or not.
  6. Users should get the right information on the number of items available in inventory. In such cases users refresh the page again and again to fetch the number of items still available.
  7. System should be able to protect itself from the script-based attacks.

Assumptions

Before deep diving into the design of this system lets discuss some of the items that are out of scope for this discussion.

  1. Auth service is already existing for managing the user accounts and permissions.
  2. Some of the backend services which are needed to complete the booking like payments, checkout etc. are already in place and we will leverage them for our use cases.
  3. Here we are focusing more on the flash sale system design so we will assume that security concerns like auth, SSL, rate limiting etc. are already in place.
  4. We are also not focusing on the UI of the system.
  5. In the scope of this article, we will limit our sale to a single geographical region.
  6. Cancellations and Refunds will be handled as usual for all the bookings.

Now let’s deep dive in the design and understand different components of the system:

Components description

API Gateway

API gateway will manger all the boiler plate functionalities like rate limiting, SSL offloading, authentication etc. One of the requirements for the flash sale is that we only allow one booking per account so that can be handled here as well. In case of script-based attacks also API gateway will safeguard the system by blocking the incoming requests.

Inventory Service

Before the sale starts Admin will have to do the initial setup by entering details of the sale in system. For this purpose, we will have an API in inventory service which will do all the background work like preparing the landing page, pushing the item details in inventory queue etc.

Question might come why we are using the inventory queue* and why are we not using DB to store the records. Reason for choosing the queue for fetching each un-booked item is to reduce the concurrency conundrum where N users are trying to book a single item. As we know each item from the queue can be fetched only once so even if there are N concurrent requests each item will be used only once and will lead to better inventory management. Also, as the load increases number of consumers on the queue can be increased resulting in handling more requests at a single time.

Booking Service

Booking service is one of the core components of this entire system. Booking service provides some important APIs like booking an item and rollback API for releasing the items on hold. Before going into the API details and the DB schema lets understand the event flows in the service.

  1. When user starts the booking by clicking on the book button for a product in the backend booking service booking API will be invoked which will fetch one item from the inventory queue and create the booking object.
  2. Booking object can be considered as the cart for the booking and the object will contain data like user data, item information etc. Booking object will be saved in the database. In this case we are using the NOSQL DB, as the need might arise to scale the database with increase in the load. Also, secondary nodes can be leveraged for read operations.
  3. Each booking object will have a unique booking id which will be passed to checkout service to complete the booking.
  4. If the booking fails at any stage booking service will also provide the rollback API which will free up the item for further bookings.

Note: As the load increases the booking service instances can also be scaled up automatically or in case of the expected high load based on the historic data nodes can be increased before the start of the sale to handle the sudden traffic

Book API:

POST: booking-service\book-item\{sale-id} // This API will create a new booking

Request:

{
“bookingId”: “”,
“status”: “”, // hold, booked, deleted
"createdtime" : "",
"updatedtime" : "",
“userInfo”: {

},
“bookinginfo”: {
“itemId”: “”,
“price”: “”
}
}

PUT: booking-service\book-item\{sale-id}\{booking-id} //update the status of the booking from on hold to booked

DELETE: booking-service\book-item\{sale-id}\{booking-id} //update the status of the booking from on hold to booked

Checkout Service

Checkout service will be responsible for taking the booking to completion by orchestrating payments, order management etc. After each step is completed the log of that step will be maintained by the checkout service. We can use the SQL DB here as we might need to run the join queries for the sake of reconciliation. This is a classic example of distributed transactions using SAGA orchestration.

Consolidation Job

As mentioned in the above object schema each booking has a status and created time. In case the booking is not completed in a specified period of time that booking will be considered as dormant and the item that was reserved needs to release. This job will perform the duty of consolidating the dormant booking and calling the inventory API to put the item back into the inventory queue for booking.

Events flow

Admin Flow:

Admin is the entity that will update the sale information in the system. for the purpose of flash sale there will be a separate API in the inventory service which will take the input of all the sale related parameters and update the inventory queue.

User Flow:

  1. User will login into the system using its username and password. Once the login is successful user can view the details of the flash sale.
  2. Flash sale page will be served from the CDN. This page will contain the description and details of the product on sale.
  3. User can now select the product to book it.
  4. On the booking page user will see the current count of the products left in inventory. Behind the scenes a book API call will be made to the booking service which will fetch one item from the inventory queue and also get the metadata containing the count of the products remaining in inventory.
  5. Each item that is fetched from queue will contain the item-id of the product, price etc. Booking service will generate a booking id and will store the booking information for the product in its DB along with the user information who is booking the product. After the booking record has been created a call will be made to checkout service to complete the booking.
  6. Meanwhile a timer will be started on the client side to make sure that the whole process is completed in a timely manner as the demand of product is high and if a user is not completing the booking that item will be again made available for booking by calling the rollback API of the booking service.
  7. Booking service will call the checkout service along with the booking id to take the booking to completion by completing the payments.
  8. If the payment is completed the booking service will be notified about the completion of booking and the status of the booking will be updated to booked. In case the payment fails, or the transaction is not completed in time checkout service will call the rollback API of the booking service to free up the item.
  9. Checkout service will also push the transaction logs into a stream from which the data will be read by the transaction logging service, order management service, order history service etc. for performing various post booking operations.
  10. If the user closes the browser before completing the transaction the record in the booking service will remain in hold status. To handle such on hold records where the booking ended abruptly, we can run a periodic job that will read such records and push the items back into the inventory queue for other users to book it.

Scaling

In this system one of the major concerns is to handle the heavy surge of load as the product is of very high demand and users will be waiting to book it as soon as it opens. The system should be able to handle this scale.

  1. As the landing page is served from CDN so after login user should be able to access the booking site without any issues.
  2. On the booking service where user will land on the clicking the book button. In this case we have made the system scalable by adding an inventory queue and as the load increase the booking service can be easily scaled and read the messages from the queue.
  3. Now the question arises how the write load will be handled on the booking service, so here we will be using a horizontally scalable database like MongoDB, DynamoDB. Also, as only one thread will be working on one record at a time operations will be much faster.
  4. Similarly, checkout service and payment service can be scaled independently as the load increases.

System characteristics

  1. Logging and Monitoring using Kibana.
  2. As the queues are getting used between various components so the throttling of load can be done.
  3. As the service is deployed on cloud using EKS horizontal scaling can be done. There are different loosely coupled components, each component can be scaled independent of any other component. Scaling criteria: Throughput, CPU usage, Memory usage.
  4. Secure against the Script attacks and DOS.

Important: * Using queue for storing the un-booked items will be a right solution only when there is one item in question. But in case number of items are more than DB will a choice for storing the records for un-booked items. In this case as well one record for each un-booked item can be maintained in DB and query can then be executed on it on each request to fetch the item to be booked.

DB Schema (Items_flashSale):
ItemID |ItemType | UserID | Status | UpdateTime

USerID will be null for items available for booking.
Status can be Available/OnHold/Booked

Now on each request the ask is to get an un-booked item and proceed with the booking. We can get the item to be booked using the query.

Select top 1 itemid from Items_flashsale 
where item_type="<type>" and status == "available";

But as already discussed, there are large number of concurrent requests so this query wouldn't help. Below two-step process will be a better choice for fetching the item for booking.

Step1: 
Update Items_flashSale set userID = "<userId>", status ="OnHold"
where ItemId = (
Select top 1 ItemId from Items_flashSale where status == "Available"
and ItemType = "<Type>"
)

Step2:
Select ItemId from Items_flashSale
where itemType = "<type>" and UserId= "<UserID>" and status = "OnHold"

Once the item is fetched for booking rest of the flow will remain same as mentioned above.

Provide any feedbacks or clarifications or improvements in comments section. If you like to discuss on some design topic, please add in comments section.

Happy learning…

--

--