Making Retries Safe Through Intelligent Error Handling

Published in

Bluecore Engineering

6 min readFeb 27, 2020

We previously discussed how Bluecore solves the problem of duplicate email deliveries. In summary, we designed a system that would intentionally drop any emails that were attempted to be sent to a customer that had received an email within the past hour. This is called Minimum Time Between Emails (MTBE).

Today, Bluecore is in the process of scaling and becoming much more cost efficient. To do this, the previous MTBE system needed to be turned into a microservice (MTBE Service). Our codebase also needed to be restructured to utilize the MTBE Service, which required us to take advantage of the two exception types (retryable and non-retryable) that are commonly raised by our platform. This restructuring led to misclassified exceptions, which are common, and multiple deliveries due to improper retry logic. The next step was to analyze our exception handling and figure out a Bluecore-specific best practice around raising exceptions and safely choosing when to retry and when to not retry. To get the full story, read on!

Avoiding Duplicate Email Deliveries: Microservice Edition

Our original process of email duplication handling, as defined by our MTBE time window, met our requirements for the overall functionality of the system. However, this previous solution became a hassle because of the underlying implementation relying on Cloud Datastore, which has a non-negligible cost for reads/writes, as well as high latency for something that we expect to check fairly often. We chose to migrate the data from Datastore to Bigtable and to create a microservice so that future infrastructure investments could be easily accomplished.

The decision to make this deduplication method a service using Bigtable forced us to think critically about the APIs we intend to support. A critical requirement was to be able to retry a send in the case of failures. The end result was two relatively simple APIs:

CheckAndSet

Parameters: Customer and a Unique ID.
Function: Request permission to send. If I have permission, update the internal tracking accordingly.

2. Unset

Parameters: Customer and a Unique ID.
Function: Request to undo the updates made by CheckAndSet.

The Customer parameter allows us to identify the recipient of the email and the Unique ID allows us to make these API calls idempotent, which is important for safe retries.

Case Studies: Handling Success and Failure Scenarios

For the purpose of this post we will assume that the MTBE service will always allow the first request for a customer to succeed.

Success

In a perfect world, only the CheckAndSet API would be used and never the Unset as shown in the following figure:

The system would then continue to process requests as normal.

Retryable Errors

In our imperfect world, errors happen and we have to handle them elegantly, if possible. So what happens if the email service raises a retryable exception? If it’s retryable, then we can assume that no email has been sent and therefore it is safe to call the Unset API of the MTBE service. This effectively adds one call to the Unset API after the email service failure call, as shown by the following diagram:

Then, retrying can be delegated to the Google Task Queues, which will reschedule the work for another time.

Non-Retryable Errors

Before getting into how we handle this case, we should define what a non-retryable error means. Let’s go through a couple of examples and see what the ideal response is:

Some non-retryable errors, such as an error with credentials, are known to have not sent an email. These kinds of errors should not hinder email sends that attempt to send within an MTBE window, so we should design our software to account for that. This looks similar to the retryable error case but marking the task as completed, since retrying doesn’t help.
Sometimes an error will give you very little information about the final state of the send, which is common in distributed systems. In the case where multiple different outcomes can occur from the same error source, we want to treat it as if an email were sent so that we don’t risk sending the same email multiple times. Practically, this looks similar to the success case in that we do not call the Unset API post-failure and just move on.

The following diagram restates our conclusion about these different error cases and shows how Bluecore handles them.

Error Interpretation

Different services tend to report errors in different ways. A lot of times these errors aren’t reported in a way that is understood by the rest of the software stack. Knowing this, one can infer that there is a need for some sort of error bucketing layer to convert these service errors to ones that are known and understood by our software. In our case, we need to convert them into either retryable or one of the two non-retryable cases. This can get complex because errors can be added, removed, or changed. If we no longer recognize an error we have to be able to handle that properly too. Based on this knowledge, we will map the known errors to the behavior we define as appropriate.

Removing Surprises From Our Software

Errors can change and become unrecognizable, so when we classify them, we should handle the unknown ones because we want to be resilient to unexpected changes. We are therefore presented with two options:

Make unknown errors retryable OR
Make them one of the two non-retryable cases

These two options essentially ask the question: do we want to call the Unset API in cases of unknown errors? Since the unknown errors can happen at any point in an email service pipeline, we must assume the worst and say that it happened post-email delivery. Effectively, we build a whitelist of errors that are allowed to call the Unset API and everything else will by default not call that API.

Pseudo-Code

This is simple enough to understand, so we can write a short Python code snippet that shows all of this behavior in code.

def email_service_send(customer, email):
    try:
        attempt_email_send(customer, email)
    except KnownExceptions as known_exc:
        if is_retryable_exc(known_exc):
            raise RetryableException()
        elif can_unset_mtbe_from_exc(known_exc):
            raise NonRetryableException(safe_to_unset=True)
        raise NonRetryableException(safe_to_unset=False)
    except Exception:
        raise NonRetryableException(safe_to_unset=False)


def send_email(customer, email, unique_id):
    can_send = mtbe_service_check_and_set(customer, unique_id)
    if not can_send:
        return
    try:
        email_service_send(customer, email)
    except RetryableException:
        mtbe_service_unset(customer, unique_id)
        raise
    except NonRetryableException as non_retryable_exc:
        if non_retryable_exc.safe_to_unset:
            mtbe_service_unset(customer, unique_id)

The email_service_send function internally converts the errors into a form that the rest of the pipeline will understand and therefore handle properly.

Trade-offs

Our solution requires someone to manage the known errors and make sure those are up-to-date, which may not be the best use of time. But the biggest win to this approach is that we can take some retroactive action to correct a wrong decision made within our pipeline. If we treat unknown errors as retryable errors and an issue were to happen, reverting that would not be possible since the delivery was already made. The Bluecore recovery story is much better with the whitelist approach, which is one of the driving reasons why we chose this approach.

Takeaways

Error handling is hard and can lead to a lot of weird behavior if handled improperly. Some things you can take away from this post are:

When integrating with a service, classify the errors that are returned. These fall into either successful, failed and guaranteed to have not processed, or failed but possibly processed.
Define what you want to happen when each of these errors occurs.
Have a safe default for ones that you are just not confident about.
Implement!

This process will then help ensure that your error handling is as safe as possible.