Building reliable workflows using Serverless

I worked in an IoT project previously where we had to synchronize devices between Azure IoT Hub and external systems.

The solution we designed supported both system acting as device master (Azure or external). The Azure master scenario relies on listening for device created/deleted events triggered by IoT Hub. The external system as master scenario is implemented using polling in a time trigger. The complete solution can be found here: https://github.com/fbeltrao/IoTHubDeviceSynchronizer

Building in a serverless architecture

This type of problem fit well with serverless, since our application only has to execute if an event has been triggered (IoT Hub events or time). Serverless optimizes costs because we are billed only for actual code execution time (in Azure the first 1 million executions of a month are free). Moreover, if many devices are created in a short time we don’t have to worry about scaling the application, Azure does that.

Implementing the solution with resilience means being able to handle scenarios where the external system is unresponsive. Using the Polly library is a well known approach to deal with external dependencies, where we define retry policies and the library will take care of following the rules while trying to get a valid response from the external system.

There are mainly 2 problems with this approach:

  1. The amount of time the function is waiting to execute between retries will cost us money.
  2. If the total retry time is longer that what an Azure Function can run for (5 minutes by default) it will be aborted.

Durable Functions to the rescue

A way to deal with these problems is through Azure Durable Functions. They allow the creation of workflows where state is maintained by the runtime. To explain with an example, imagine we have to build a serverless solution to identity which URLs load faster. The code below shows a silly implementation using Durable Functions (that does not prove anything regarding web site loading speed):

The “RunOrchestrator” function will run multiple times, stopping each time a context.CallActivityAsync is called for the first time. The context.CallActivityAsync method call will check if the activity has already been executed, if not it will queue the activity and stop the orchestration. Durable functions runtime will pick and execute queued activities in the background, rescheduling the orchestration once the pending items have been processed. Debugging the function might help understanding it:

Look at the output window: it demonstrates that each activity is executed once and the result is rendered upon completing all previous activities.

Measure site speed for https://www.google.com
...
Measure site speed for https://www.bing.com
...
Measure site speed for https://stackoverflow.com/
...
1. https://www.google.com 194,3294ms
2. https://www.bing.com 202,1277ms
3. https://stackoverflow.com/ 351,5166ms

Durable functions can do more than that and I encourage you to read further here.

Retrying with Durable Functions

Durable functions also provide a way to call an activity with retries, so upon failure the runtime will schedule retries according to the policies defined. We don’t pay for the in-between retries time, plus each activity execution has its own 5 minutes window to execute before the Azure Function runtime aborts it.

The example below demonstrates calling an external system with the following retry policy:

  • Try at most 100 times
  • Try for at most 1 day
  • Upon first fail, retry after 1 second
  • Subsequent fails will have back-off retry coefficient of 2