Enough with the Spamming: Enforcing Email Frequency Caps
Among the many engineering challenges we face at Bluecore, an interesting problem is implementing per-user email frequency caps. We call it the Minimum Time Between Emails (MTBE) constraint. This is a guarantee we provide our clients promising that in a given time window, we won’t send more than one email message to any one person.
As it turns out, this seemingly innocuous constraint is not so easy to implement. The main hurdle here is the fact that we are working with a distributed system — at scale, you have to design for fault tolerance as drives and networks will fail as a statistical reality.
Let’s attempt a simple solution to the problem to see what could go wrong. I also invite the reader to come up with a solution of your own before reading further, and see if you can spot any caveats with it.
One way of enforcing the MTBE constraint is the following:
- For each customer, store the timestamp of the last email sent to this customer in a database
- Before sending the email, check that the difference between the current timestamp and the last send time to this customer is greater than the MTBE. If not, abort sending this email.
We could even translate this into Python code:
def send_email(customer, email, mtbe): current_ts = datetime.utcnow() if current_ts — customer.last_sent_ts < mtbe: return # do the actual email send through some API email.send(...) customer.last_sent_ts = current_ts customer.save()
Unfortunately a couple things can and do go wrong with this implementation:
- If the server crashes after the call to send the email but before saving the customer object, then we’ve left the database in an inconsistent state and the next send to this customer could violate MTBE. Bluecore uses many thousands of servers at any point of time, so this is a failure mode we must design for.
- If two threads are concurrently executing this code and sending to the same customer, there is a possible race condition allowing the MTBE check to succeed on both threads and the emails to be both sent out.
So this approach is far from perfect. We could perhaps wrap this function in a transaction, or pre-emptively set the timestamp before sending the email, etc. We won’t elaborate any further here as the purpose of this exercise is to show the things that tend to go wrong in real world implementations.
Prior Art: Exactly-once Message Processing
The MTBE constraint bears remarkable resemblance to another well-studied problem in distributed systems — exactly-once message processing. Let’s dive into that problem a little as that should help with our discussion.
The problem is that in a lot of applications, we would like to reliably deliver a message from one machine to another and process it, exactly once. There are tons of articles written on this topic already, making all sorts of claims. The conclusions arrived at usually depend on what assumptions you can make about the underlying system. In particular, this is perfectly achievable if you can make processing the message an idempotent operation — an operation that results in the same system state no matter how many times you perform it. For example, setting an integer to the value 7 is idempotent, whereas incrementing it by 1 is not.
Typically, if processing the message has no side-effects out of the system you control, idempotency is achievable. To be more specific, suppose both the sender and the receiver of the message are application programs you maintain, and processing the message involves storing some data in a transactional database, then you can have exactly-once semantics. The usual implementation consists of two parts: the sender is responsible for retrying a message until it gets acknowledged by the receiver, and the receiver is responsible for making sure the processing of the message is idempotent. One way of ensuring idempotency is through deduplicating messages on the receiver, where the deduplication logic should be in the same transaction that does the message processing. Once you nail that, idempotency is achieved.
On the other hand, if the no-side-effects condition cannot be assumed, say processing the message also sends out an email, then exactly-once semantics quickly goes out the window. Now we can no longer make the operation idempotent since retrying a message will always result in the side-effect happening again.
As it turns out, this scenario is exactly where Bluecore operates. The final stage of our email pipeline involves making an HTTP call to send the email through an Email Service Provider (ESP), which is a side-effect on an external system. The result of this is that violations to exactly-once processing can happen. Specifically, the above figure shows one such case where double sending occurs because of retrying.
There are two counterparts to exactly-once message processing semantics: at-most-once and at-least-once. After realizing that exactly-once is not on the table, we can choose to go either way. In practice, we implement at-least-once semantics to take advantage of retries on failures. More on this later.
Back to MTBE
Let’s get back to our discussion of MTBE a bit. Message processing semantics is relevant to MTBE since it is akin to a variation of at-most-once processing: if you cannot send more than one email to the same customer in some time window (MTBE), you obviously cannot send the exact same email more than once in a row (at-most-once). Exactly-once semantics doesn’t apply here since MTBE doesn’t say anything about having to deliver the email. For the sake of MTBE, if a delivery fails for some unknown reason, we can just forget about it. This means we can enforce MTBE perfectly if we want. For instance, we can transactionally check whether MTBE is respected and update the customer’s timestamp. If the transaction succeeds, we then deliver the email. If any of this fails we move on. Absolutely no retries are attempted. The process is illustrated below.
In practice, this approach is not so great. Getting rid of retries means we are sacrificing a lot of emails that should have been sent: if the ESP is experiencing an outage or planned maintenance downtime, we shouldn’t be dropping emails without retrying them! The same reasoning goes for a lot of other types of errors.
So in fact we do have retries, but let’s first look at the architecture of our email pipeline. Given a customer from our audience generation engine, the email pipeline is responsible for assembling the email and delivering it through an ESP. Something like this:
Not so bad, right? One hidden aspect of this diagram is what the arrows really represent. We make extensive use of asynchronous task queues (Google App Engine’s Task Queues to be specific) throughout our infrastructure. This is how we pass around data in the above diagram, so the arrows are really task queues. In the context of MTBE, we really care about one such queue: the one between the processing service and the delivery service:
Using a task queue here is a deliberate choice on our part that addresses the problem mentioned before: to make sure emails do not stop sending because of some intermittent failure in our infrastructure or at the ESP. As with most task queues and streaming frameworks, App Engine’s Task Queues allow you to configure what kind of retry behavior is desired. We use an exponential backoff policy with indefinite retries by default, so if some unanticipated error shows up we will keep retrying the send until it succeeds.
What We Used to Do
With task queues in the picture, we can flesh out our approach a bit more. Here’s what we did with regards to the MTBE problem in the past: inside a transaction, we check and update the customer timestamp, and then schedule the email for delivery if the check succeeds. The task queue targets the delivery service, retrying the email delivery if there are any failures.
The great thing about this implementation is that we can make some nice intermediate guarantees. App Engine Task Queues conveniently allow you to transactionally schedule a task, so scheduling acts like a database operation inside the transaction. Thanks to this feature, we schedule the email if and only if MTBE is respected. If scheduling the email were equivalent to sending it, all our problems would be solved. Unfortunately, that’s where the problem is: scheduling the email is not the same as sending it. With this architecture, there’s really no telling how long the email will spend in the task queue before being delivered. Because of the nature of MTBE constraints, as time passes, emails that wouldn’t have violated MTBE before could suddenly become a violation (perhaps because we’ve just sent this customer another email). As a result, if emails stay on the task queue too long, it’s easy for them to violate MTBE again.
One concrete example of this problem is when there is something wrong with an ESP, causing tasks to be stuck in the task queue while constantly retrying for an extended period of time. In this case, as multiple MTBE windows pass, multiple emails to the same customer would pass our transactional MTBE check, and they would all be waiting in the task queue. Once the problem with the ESP resolves, we would suddenly bombard this unfortunate customer with a string of emails.
Our Current Approach
As we learned from past scaling junctures, and as we integrated with different ESPs with varying uptime SLAs, we’ve gone through several iterations of the MTBE constraint implementation. Currently we enforce it in 3 stages (illustrated below):
- When generating the list of customers from our Audience Engine, we cross reference our analytics database to see which customers haven’t received emails in the past MTBE time window and skip customers that have.
- This step remains the same as the previous section describes. Before scheduling an email delivery onto the task queue, transactionally check and update a timestamp on the customer. This timestamp indicates the time the email was scheduled.
- Finally, at the email delivery service, we do the same check against another timestamp on the customer object. This timestamp indicates the time the email was sent, and also we update the customer object after we send the email to enable retries.
There are caveats with each of these steps. Combined they vastly reduce the likelihood of MTBE violations. We’ve already talked about the problem with step 2 in the previous section. Let’s go through steps 1 and 3 in more detail.
Step 1: Enforcing the MTBE constraint at Audience generation is great because it saves all the unnecessary processing later. However, the analytics database doesn’t always contain the most up-to-date information (and it is not supposed to), so inevitably there will be some customers who were sent emails recently, but for which the deliver event hasn’t been inserted to the analytics database yet. These customers will incorrectly qualify for the generated list and go on to be processed by the email pipeline.
Step 3: This is a safeguard against the caveat of the second step in the previous section. The delivery service is the place where email delivery really happens. Checking the MTBE here enables us to minimize the time between the check and the delivery, and thus minimizing the likelihood of MTBE violations that materialize after the check. However, this step has its own limitations. Since delivering the email amounts to an HTTP call which is not a database operation, we cannot package this step in a database transaction. This has two implications:
- If the server instance were to fail after the delivery but before updating the customer’s timestamp, the task queue would automatically retry the task, leading to a double send.
- Two emails to the same customer could both pass the MTBE check if concurrently executed.
Both of these problems are very unlikely to happen, but sending a few millions of emails everyday means that these very unlikely things happen several times a day.
Real-world engineering can be hard and unforgiving. For a long time we operated with the simpler solution to this email frequency problem (described in the section “what we used to do”), until one of our ESP integrations had some unexpected downtime, causing some customers to receive multiple emails within a short span of time. In the meantime, these problems provide great learning opportunities. We learned from our past lessons, and built a better system more robust to a variety of failure scenarios.
Another lesson here is that getting some guidance from relevant literature is often a good idea. Knowing the theoretical limitations of exactly-once processing saved us from going down the rabbit hole of coming up with a perfect solution. Instead we iterated on the current system to make it reliable enough, while knowing there’s always more room for improvements. For example, one further improvement is that the concurrency issue mentioned in the third step can be solved by introducing a distributed locking mechanism.