Trendyol Coupon Journey: How Do We Deal With Race Condition in Wheel of Fortune
In this article, we will talk about how our Wheel of Fortune random algorithm works, what problems we encounter while distributing the coupon to the user, and what solutions we apply to these problems.
What is Wheel of Fortune?
Wheel of Fortune is a system divided into 8 equal parts (15–30–50 TL). Slices in the wheel consist of collectable coupons with certain limits. For example, 100,000 pieces of coupons worth 15 TL can be collected, and 50,000 pieces of coupons worth 30 TL can be collected.
When the user spins the wheel, we collect coupons with our random algorithm. This algorithm identifies one of the coupons that has not filled the limit to the user.
The First Algorithm We Developed
When we started the journey of our wheel of fortune algorithm, in the simplest form, we were making a random selection from a list containing the wheel slices. We were making this random selection with the random method of Kotlin’s List class. In short, random selection was not under our control here. We left the choice initiative to the method :)
There was a significant disadvantage in that random selection was not under our control. In high percentage, tended to choose small coupons. This left us in a difficult situation in the distribution of large coupons. That’s why we decided to take control of the algorithm :) After considering the solutions we can implement, the birth of our new algorithm, which we will talk about below, began.
The New Algorithm
First, the sum of the limit values of collectable coupons is taken. Let’s call this value totalWeight. Next, let’s generate a random number in the range of zero to totalWeight. Then, we create collectable coupons and a new Map List that stores the coupon itself and its range, with the beginning of the previous coupon’s limit +1 and the ending of its limit.
Finally, we define the random number that we produce, whichever range it corresponds to in this Map List, to the user. In this algorithm, the difficulty of distributing large coupons that we experienced in our first algorithm was avoided by increasing the random number from the range of 1 to 8 to the coupon limits that can be collected with 1.
Race Condition Problem
Wheel of Fortune Service runs on our Moon and Mars data centers (Trendyol works in two datacenters. We use planet names in datacenters). We have Couchbase clusters that contain the same data on both datacenters. After the coupon is defined to the user with the Wheel of Fortune Service algorithm, the collectCount value of that coupon is increased by 1 by going to these databases. However, an important handicap awaits us here.
For example, let’s say the coupon collectCount values of 5 TL in the databases on the Moon and Mars are 2. While collecting coupons, the user came across a 5 TL coupon on Moon DC which was defined in the database on Moon. This information was sent to the database on Mars(XDCR), but at the same time, another user on Mars came across a 5 TL coupon and first increased the database collectCount information on Mars. In this case, while the collectCount on Mars was supposed to be 4 at that time, it became 3 and the consistency problem arose. While expressing this with small numbers, there may be a perception that it will recover later on, but when we experience this scenario with large numbers, a serious inconsistency problem occurs. For example, when the 5 TL coupon, which has a limit of 100,000, is now full, the collectability of the coupon stops, but as a result of completing the pending transactions in the back, around 120,000 coupons may be defined. We also observed differences between the count of collectable coupons collected by the user and the count of coupons generated.
We use XDCR(Cross Data Center Replication) of couchbase feature for data bidirectional replication between two datacenters.
To solve this problem we tried optimistic locking. When applying this technic, it caused high response time and exceptions with bad user experience because we only have 8 document.
By abandoning this solution, we shook hands with the business team to distribute extra coupons instead of showing errors and giving users a bad experience. For this, it was the best solution for us to set the threshold and keep the coupon collection limit between the coupon collection limit and the threshold value, rather than reaching the actual limit. Although we cannot stop the race condition problem 100% due to the distributed architecture, we are in a position to meet the demands of the business team.
The way we followed to solve this problem
Instead of operating on the same document in database clusters on both DataCenters, we created a new document for this job in both databases and added two new documents primary and secondary for each wedge. So what do these primary and secondary documents do?
The primary document in the database of the user whose coupon is defined on the Moon is increased by 1 and this information is transmitted to the primary document of the database on Mars by XDCR. Likewise, the secondary document in the database of the user whose coupon is defined on Mars is increased by 1 and this information is transmitted to the secondary document of the database on the Moon with XDCR.
The reason for having two separate documents was to increase the number of collections with atomic counters in data centers.
Couchbase Eventing Service
In Couchbase, create-update-delete-expiry events on the bucket are detected as mutations. Eventing service, when a certain mutation occurs (such as create, update, delete),
- Change in other systems
- Cascade delete operation
- Instantly change document content
“Event-Condition-Action” model is applied in data changes. With the Eventing service, it is possible for software teams to realize more understandable business rules. However, the complexity of the infrastructure and database layers is reduced for the extra work that needs to be run during data changes.
As the coupon team, we use the Eventing Service to make the coupon limit reached if the collect limit has reached the threshold value by checking the threshold on the document each time the coupons are collected. For example, if we set the limit of the coupon as 500 and our threshold value as 90%, when the collection limit is 450, the coupon limit will be reached and collection will no longer be allowed.
While performing these operations, the primary and secondary documents are collected with the Couchbase Eventing Service feature in every operation performed on the database, and when the sum of the values in the two documents reaches the threshold (90%), the limit value of the relevant coupon is filled.
As new technology dependencies occur when we want to produce better solutions, we preferred to produce solutions using the technology we have. As a result of the solution, we see that the coupons collected by the users and the number of coupons we created are the same. We have achieved consistency here.
Conclusion
In this article, we have talked about the technical problems we encountered and the architecture we designed to solve these problems. In addition, we showed the technologies we use, and how we implement them and mentioned how we solved that problems. We used the existing solutions most properly by adapting them to our system. We have made improvements by taking the opportunities and solutions offered to us one step further in every situation.
We decide as a team on architecture and technology decisions and we are organizing analysis meetings. Join us to develop millions of coupons and high throughput applications. We have a lot of challenges.
And finally thanks for reading…