Handling failed webhooks with Exponential Backoff

Vinicius Abreu
Wellhub Tech Team (formerly Gympass)
6 min readApr 7, 2021

--

Here at Gympass, one of our main concerns is how reliable and responsive our applications are, always taking into account how this could impact our partners and their business. Having a mature and stable solution is key to promote the responsiveness that our partners deserve and also a guarantee that they are not affected in a manner that could cause them harm, even in failure scenarios.

In this post, I would like to introduce you to an approach used at Gympass to handle failed webhook requests to our partners, looking forward to a smarter way to retry these notifications based on an exponential interval. This is commonly known as the Exponential Backoff algorithm. But before we begin, let’s get some background on how we implemented this strategy and in which scenario this became handy for us.

A bit more context: Notifying partners via webhooks

Every time some action takes place in our ecosystem, our partners are notified about it via webhooks. For instance, a webhook is triggered when a class is booked, or when a Gympasser wishes to workout in a specific gym and performs a check-in, which grants her/him access to the gym. Gympass’ backend listens to certain events that represent one of those actions and, after locating all the needed information of the given partner’s webhooks, makes an HTTP request to the configured endpoints.

The engine responsible for triggering these messages is a Scala application built on top of Akka, a toolkit for building highly concurrent, distributed, and resilient message-driven applications which is an implementation of the Actor Model on the JVM. The application also works alongside RabbitMQ, where all events which it reacts to are published and consumed. If you are new to Akka and/or RabbitMQ, you can check their docs on the links!

So far so good, but what happens when a partner’s API is down? What happens when there is some network issue? How is the partner notified if we face issues like this? Now that we raised some very clear and present concerns, it is time to take a look at how we used to handle these failure scenarios in the past.

Handling failed webhooks: The hasty approach

The first solution already had a retry strategy in place. By default, failed notifications were retried up to three times in a row, but it was not effective, since all retries had no considerable interval between them. A failed message went to a retry queue that was processed again and, most of the time, failed again and again, until all attempts were exhausted and the message finally directed to a DLQ (dead-letter queue). This means that if our partner’s system has an outage, all retries would fail without giving a chance for it to recover in time to receive the notification.

This approach has some issues:

  • Since all retries are performed in a row, the time window for these retries are really short;
  • Short time windows for retrying a webhook could prove themselves ineffective, since the partner’s API could demand more time to recover than we have to reprocess the messages;
  • Increasing retries would not be effective as well, and it would result just in a waste of infrastructure resources;
  • This could end with partners not being notified, which represents a business issue.

It looks like retrying failed messages in a row is not the best solution to address this problem. What if we introduce a time window between attempts for giving the partner’s API some time to recover and then receive the notification successfully? And even better, what if this time window is not fixed, but based on a value that increases exponentially, giving margin for smarter retries? Let’s take a look at how we can achieve this idea.

Introducing a backoff strategy

The main point of the backoff strategy is to distribute the failed message through different queues where each one has a default TTL (time to live, which means for how long a message will lie in a queue) value for its messages. Obviously, before a message is retried, it should arrive through an origin queue, which is our main queue for webhooks. If all attempts to trigger a webhook fail, we should send the failed message to a final queue that will have messages for further investigation.

Now we come to the exponential side of the moon. The queue TTL will receive a calculated value based on the following exponential backoff formula:

{interval} + {current_retry_attempt} ^ 4

Which is represented in our codebase as the following code block:

If we apply the formula above to generate 10 TTL values, we will end with:

Note that for each attempt, the next retry value exponentially increases, giving the desired effect. On each failed attempt, the next one will take a little longer to be processed.

Notice that we have now gathered a bit more information about the solution:

A main queue, which delivers an event that will trigger a webhook. We can call it main.webhook.queue, which is bound to the main.webhook.exchange.

N retry queues, which could be called retry.webhook.queue-M (where M is the attempt number in a range from 1 to N), that are bound to a retry exchange called retry.webhook.exchange. In our case, these queues were previously manually created.

An ”abandon queue” which holds messages related to webhooks that could not be delivered and failed in all attempts. We can call it abandon.webhook.queue, which was also previously in place.

Let’s put all this information together and break the flow down with more details.

A Scala consumer listens to events published by some other application on the main exchange, which routes this message to the main.webhook.queue. The application locates webhook information on a database based on the incoming event data like URI, webhook secret and so on. The consumer sends an HTTP request to the located endpoint.

The performed request returns an error, and now we start the retry flow itself. The message receives a x-retry header with the value of 1, that works like a counter for the retry flow (for further attempts, this counter will be incremented and subsequent messages will also have this header). It is now published on exchange retry.webhook.exchange and will be routed to retry.webhook.queue-1 and, after the queue TTL expires, will be routed back to main.webhook.exchange and be consumed again by the Scala application.

The main point in this flow is how a message is automatically routed back to the main exchange. Let’s take a look at some headers that must be configured on each retry queue:

x-message-ttl: This header holds the TTL for all messages in this queue. This value comes from the already calculated TTL based on the exponential formula.

x-dead-letter-exchange: Will receive the exchange which messages will be routed, in this case main.webhook.exchange.

x-dead-letter-routing-key: The routing key for retriable messages

It will retry to notify the partner, which will respond again with an error and the message will now be routed to the retry.webhook.queue-2 (remember, this count is controlled by the x-retry header present on the message itself) through the retry exchange. This process will continue until the failed message arrives in the last queue, retry.webhook.queue-10, that will send the message to abandon.webhook.queue.

The following image represents the entire flow in a graphical way:

Pretty cool, isn’t it? This way, we have a better strategy for handling failed messages and we give a secure margin for our partner’s API to recover and successfully receive the notification.

Before we finish, we need to define in which scenarios we should perform a retry. The table below has some rules describing when we should and when we shouldn’t retry.

In our scenario, we could notice when messages go to the seventh queue or higher, they are usually cases where there is some issue with webhook information, like wrong URI or secret. This could be shown in some kind of dashboard that gives us visibility to act more quickly notifying our partners that there is something wrong with their webhook information, and then we can fix that.

I hope this solution brings you a good and alternative option to handle failing scenarios like this one.

Thanks for reading!

--

--

Vinicius Abreu
Wellhub Tech Team (formerly Gympass)

Software Engineer, Brazilian Jiu Jitsu practitioner and ramen lover. Doing amazing things @ Gympass.