Unique & Finite E-Commerce Coupon Codes: Surprisingly Tricky at Scale

Joe Guzzardo
Bluecore Engineering
4 min readJan 31, 2018

As an eCommerce personalization platform, Bluecore needs to be able to disseminate unique coupon codes to shoppers at scale in personalized emails. After a user uploads a list of coupon codes, we have two responsibilities. First, we must email each code exactly once. Second, we must accurately count how many codes have been sent so that the set of coupons can be refilled before it empties. These tasks seem fairly straightforward, and for a while at Bluecore, they were. But as our business grew and we started sending 250x the number of emails per day, our high volumes and the constraints we imposed started to crack the original implementation.

Because we send many emails at the same time, a naive implementation could lead to multiple emails containing the same coupon code. To avoid this, we used a database transaction. A transaction is a sequence of operations performed as a single unit of work. Either all of a transaction’s data modifications are performed or none are performed. Most importantly for us, modifications are isolated from other concurrent transactions. When we marked a coupon as used from within a transaction, it could not be seen as unused by any parallel transaction, ensuring that every coupon was issued exactly once.

Counting the number of emails delivered with coupons also faced a concurrency issue. Our database (the Google Cloud Datastore) doesn’t have an efficient operation for counting stored entities, so we created a single entity that held a “counter” field and incremented its value within a transaction to keep it accurate. In the spirit of doing the simplest possible thing that worked, we used the straightforward solution, tested it, and shipped the feature.

This worked well for several years. Initially, Bluecore let users send emails at volumes that rose and fell fairly predictably, driven by each day’s shopping activity. Over time, our product allowed much larger volumes of emails and, additionally, let them concentrate in much shorter time frames. Over 10x the number of emails sent over an entire day in years past were often being sent within a single hour!

One day, a user applied unique coupons to a very large personalized campaign of many millions of messages. The campaign stalled and caused backups in some of our delivery queues, which triggered alerts in our automated monitoring (Datadog via Veneur). It became a high priority bug and we immediately embarked on finding out what went wrong and how to handle this increase of 250x volume.

We found that these errors occurred because both of our coupon transactions — marking the coupon used and counting used coupons — were set up such that every concurrent process was attempting to access the same coupon entity. This caused update contention. When we encountered a very large batch, only one email could make progress at a time and all other processes failed. They stalled for so long while waiting for the single entity that they either reached concurrent database connection limits or the processes themselves timed out.

When querying our database for an unused coupon, it always returned the same coupon because the results were returned in a deterministic order. To fix this, we added randomization to the queries. At upload, we assign each coupon a random float value between zero and one. When looking for a coupon, we query for the unused coupon with the next value greater than another random number. This means each query will likely return a different coupon, so the writes are spread across all coupons evenly.

def get_unused_coupon():
search_neighborhood = random(0, 1)
possible_coupon = Coupon.filter(used = False)
.filter(rand_value >= search_neighborhood)
.order_by(rand_value)
.get()
if not possible_coupon:
possible_coupon = Coupon.filter(used = False)
.filter(rand_value < search_neighborhood)
.order_by(rand_value)
.get()
if not possible_coupon:
raise “No more coupons.”
return possible_coupon

Initially, we thought that was our only problem. However, when we tested it, we found that the processes were still getting stuck, but now they were stuck on the count of available coupons. To resolve this, we divided the counter into multiple entities, or “shards”, following a Google Datastore example. After issuing a coupon, we increment one of the shards at random. This ensures that the writes are spread across multiple entities instead of conflicting on one. One challenge is that we don’t really know how many shards we will need. Instead of setting a static value that could be too large or too small, we decided to start at a fairly low value and dynamically create new shards when necessary. We create new shards when we encounter contention errors. (That failed thread is retried after the new shards are created.) To find the total number of coupons used, we sum each of the shards.

Even though we solved our coupons use-case simply, there were still hidden issues that only manifested at scale. “As simple as possible, as powerful as necessary” is a design tenet at Bluecore Engineering. When the nature of our business changed and that solution no longer fit, that’s when we upgraded the feature. This just-in-time scalability approach creates a lean product with low upfront costs that can evolve as necessary.

--

--