Generating Millions of Promotion Codes
As Promotion Team, one of our responsibilities is to provide a system that generates unique promotion codes. In this article, I will talk about the design of this system as well as the scalability and fault tolerance aspects.
Promotions are the discount that is applied to the shopping cart. In addition, promotion can be created with a variety of conditions such as brand, category, seller, etc. My goal in this article is not to talk about the details of the promotion service and how it applies discounts to the shopping cart. You can find more about it in my colleague’s article. You may also listen to our Promotion & Coupon podcast here.
A promotion code needs to be provided by customer to be applied, but for us it is just another condition for the promotion apply logic. When users enter a code, a promotion that is related to the code is applied to the shopping cart by Promotion API. A promotion code can be generated as unique or it can be defined by the user who creates promotion. For example, MEGADETH21 can be defined as a promotion code by the user.
It could be required to generate millions of unique code for some promotions and a code is corresponding to a row on SQL Server. This row contains id, promotionId, code, maxUsageCount and currentUsageCount columns. We have to be sure that the desired number of codes have been generated. Of course, a solution could be possible like below:
Let’s have a look at this solution from the standpoint of fault tolerance. Our application runs on Kubernetes. While creating codes, the database may be down or some network disruption between Kubernetes nodes and database may be happened. As software engineers, we are trying to develop reliable systems in an unreliable world so we must consider all possible scenarios. Unfortunately, designing 100% reliable system is impossible but we should handle faults as much as possible. Another important consideration is scalability. We assume that everything works great but a solution like the above is still unacceptable because of scalability. As I said before we need to generate millions of promotion codes and our system should deliver codes as fast as possible. If the system just needs to generate a few thousand unique codes, probably this solution meets our requirements. But in order to generate millions of unique codes, this solution doesn’t meet business requirements. Nobody wants to wait for hours.
Let’s have a look at the our architecture:
- To create a promotion with unique codes, send a request to the promotion API.
- Firstly Promotion API checks the count of codes to be generated. Secondly, it creates events based on this count. I’ll demonstrate this with an example. Promotion API has a batchCodeCount config. (count of codes to be generated / batchCodeCount) +1 equals the number of events. ( +1 for remaining if it exist). We assume that 5 million unique codes generation is requested to Promotion API and batchCodeCount config equal to 50 thousand. In this case, Promotion API generates 1000 events, each with the following body:
In this body, codeCount indicates that how many codes should be generated for this event and promotionId indicate that codes should be generated for which promotion. Finally, Promotion API produces events to Kafka.
3. Instances of Promotion Code Generator consume events.
4. Promotion Code Generator sends as many requests as codeCount to Promotion API then a unique code is generated for each request by Promotion API.
With this solution, we have a scalable system. If we want to speed up code generation, we can scale horizontally Promotion API and Promotion Code Generator. Additionally, we can increase the partition number of the topic and decrease batchCodeCount config to speed up code generation. When we decrease batchCodeCount config, Promotion API will generate more events. In this way, more code can be generated simultaneously. Of course, even for this design, scale is limited. The database is under more load during we speed up code generation and it is going to be a bottleneck soon or later.
There is a big problem which we have to struggle with this design. A fault may occur during Promotion Code Generator sends requests to Promotion API. If this happened, a missing number of codes are generated. So our system needs to handle this.
We handling this situation like below:
- When a fault occurs, Promotion Code Generator produces an event to error topic. The event produced by Code Generator Consumer has the same scheme as the event produced by Promotion API. But the value of codeCount will be different. If Promotion Code Generator consumes an event that codeCount is 50 thousand and a fault occurred after it sends 10 thousand requests to Promotion API then it stops to request to Promotion API and produces an event that codeCount is 40 thousand to error topic. Body of this event like following:
2. Kafka Shovel is responsible for moving events between topics. It runs every 5 minutes and moves events from error topic to retry topic. It also adds a retry count to the event header. Each time an event is moved from error topic to retry topic, retry count is increased by 1. If retry could is 5, Kafka Shovel doesn’t move that event. This article contains further information about Kafka Shovel. Kafka Shovel’s source code is also available here.
3. Promotion Code Generator consumes events from the retry topic and continues sending requests to Promotion API.
I hope you like the article. See you next one 🙂