Reliably processing Azure Service Bus topics with Azure Functions

Photo by: Negative Space (Pexels)

At ASOS, we’re currently migrating some of our message and event processing applications from worker roles (within classic Azure Cloud Services) to Azure Functions. A significant benefit of using Azure Functions to process our messages is the billing model.

For example, with our current approach, we use worker roles to listen to subscriptions to Azure Service Bus topics. These worker roles are classes that inherit RoleEntryPoint and behave similar to console apps – they’re always running.

Let’s suppose we get one message in an hour. We’re billed for the full hour that worker role was running. With Azure Functions, we’d only have been billed for the time it takes for that one message to be processed — when it is received by the function.

Another benefit is a huge reduction in code we have to maintain. Most of our queue listener apps are simply listening to a service bus topic — and calling the appropriate internal API — in order to keep downstream systems in-sync within the ASOS microservices architecture.

With worker roles, we manage the whole queue listening, deserialization, error handling, retry policies etc — ourselves. This has resulted in some quite large applications needing to be maintained.

Some of our ServiceBusTrigger Azure Functions could look almost as simple as this example:

In this scenario, our function receives the BrokeredMessage in PeekLock mode. We can change the BrokeredMessage parameter to a string or a type of our own class – the SDK will deserialize for us, but the underlying behaviour remains the same. The code within the Run method is then executed, and, if there are no exceptions, .Complete() is called on the message.

Fantastic!

This simplicity, however, comes at a cost.

When an unhandled exception is thrown within the function,.Abandon() is called instead of .Complete() on the brokered message.

.Abandon() doesn’t dead letter the message, but instead releases the lock that the function had on it and returns it to the queue, incrementing the DeliveryCount property on that message. The function will then receive this message again at some point, and will try again. Emphasis on ‘at some point’ because this is undeterminable. It could be instant, or in a few seconds, depending on how the queue is configured, or how many messages there are and the scale-out settings of your function.

If it fails again, it will increment the DeliveryCount property, call .Abandon() on the message, returning it to the queue.

This will continue until the MaxDeliveryCount is exceeded. Both the queue and the subscription can have their own MaxDeliveryCount (the default is 10). Once this is exceeded, the message is finally sent to the dead letter queue. There are a few problems with this approach. For example, what if the exception was because our API returned a 503? It may be unavailable, but we expect it to return.

In this scenario, we’d like to delay the retry of the message for one minute.
Currently, the message will be retried up to 10 times before being sent to the dead letter queue. Those 10 executions could happen within a second, without any pause to let the API we’re calling recover.

Another scenario could be if our API returns a 40x error. The data we’re sending is bad. It’s not going to get any better, it’s immutable, so we should dead letter that message straight away with a DeadLetterReason so we can log on it, or something else can pick it up and try again.

I naively tried this:

This causes the message to be sent to the dead letter queue, however, the WebJobs SDK is still managing the lifecycle of the message and, since we’re not throwing an exception to the function, still tries to call .Complete() on the message.
This results in the following exception:

The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue.

Under the hood, the Azure Function Service Bus Trigger does something like this rudimentary example:

  1. The runtime receives the message from the topic.
  2. It then runs your code, passing the message as a parameter.
  3. If there’s an exception, it calls .Abandon() on the message.
  4. Otherwise, it marks it as complete if there’s no exception (which it assumes is a success).

This might work in very simple cases, but, for our case, we needed to have greater control over the behaviour of the message — particularly with retrying. Fortunately, this was addressed in the following GitHub issue and this (now completed) pull request.

This allows us to add autoComplete: false to our host.jsonfile, which alters the way messages are handled by the function runtime.
Should the function complete successfully (i.e. there were no unhandled exceptions thrown) then the message will not automatically have .Complete() called on it.

This does mean we must manage message lifetime ourselves.

Our previous example will now work, allowing us to .DeadLetterAsync() a message:

Retrying messages at some point in the future

Taking things a step further, we can also retry messages at some point in the future by cloning and resending to the topic.

Adding an output binding to our function makes this incredibly easy:

In this example, we first call apiClient.PostAsync(thing);. If that’s a success, the message is marked as complete (this is important). If we receive a 400 Bad Request from our API, we dead letter that message.

If we receive a 503 Unavailable from our API, we retry the message in 30 seconds.

We do this by cloning the original message, setting the ScheduledEnqueueTimeUtc of that cloned message to 30 seconds from now, and adding to the output queue. We track the number of retries using a custom ‘RetryCount’ property on the message, and, if it exceeds 10, we simply dead letter it.

Note: It’s important to complete your messages
Note my emphasis on msg.CompleteAsync(); – on both the successful API call and the retry. Because we’ve set autoComplete: false, the runtime is no longer managing the lifetime of our messages. While testing, I forgot to add this line, which meant the original message wasn’t completed. It wasn’t abandoned either, which meant it went back on the queue. To be retried, again and again and again. The queue had over 40million messages in when I noticed. Whoops!

Conclusion

Azure Functions makes processing queue messages extremely easy, especially if your scenario is a simple in / out flow with no error handling.

Using the new autoComplete: false setting allows greater control over the lifetime of the messages received.

We can use a ‘clone and resend’ pattern to offer scheduled retries of messages.


My name is Alex Brown. I’ve been a Consultant Software Engineer at ASOS.com for just over 18 months. I love writing well-thought-out code that does what it’s supposed to do.

When I’m not gazing into my laptop, I enjoy throwing a few weights around in the gym and reading motivational books — most recently Miracle Morning.