The Road to (Auto) Recovery

James Hicks
Relaymed Engineering
5 min readDec 9, 2021

One of the largest projects we’ve been working on involves us receiving COVID-19 tests from medical devices and reporting them on behalf of our customers to the various US states.

While all of the US states are required to report to the CDC, the way that each state receives information is varied. For the sake of this article I’m using a standard SFTP example, though in reality there is a mixture of methods that we have used. HL7 is a medical format for data communication, and while the specification is quite strict, each state still have different intricacies and requirements.

Due to all of these circumstances we need to build a system that was not only resilient, but was also able to be custom for each state.

Here is our story of where we started, and our eventual road to auto recovery.

The skateboard

We often talk of first building a ‘skateboard’ when creating a new service, something inspired by this article, and this project was no different.

We originally started work on two states, which for this example we will call New York and Florida. We needed to do some generic data manipulation for all states, while each individual state would also require it’s own specific configuration to create the final HL7 message. Once the message was created, this needed to be sent via SFTP to an external folder that the state had setup for us.

To begin with, there was an incoming POST request to an Azure Web App that would contain the medical test information that we need to report to the state. This app would then do the generic data manipulation, before going to a state specific communicator class that would apply the state specific configuration, before finally sending it onto the state:

Stage 1: Skateboard

We started with the ‘skateboard’ so we could work out our pain points, and there were a number of known issues with this approach. Firstly, there was no redundancy in handling if the post request was unsuccessful in getting to the StateReport web app, while also there was no error handling or retry within the web app if an exception was thrown.

While something like Polly could be used for implementing retries for the HTTP request from the service that was sending the test result, our approach was to instead use a queue which would be read from an Azure cloud function.

The other issue was that the StateReporter web app was doing a lot of different tasks. It was performing generic data manipulation, doing individual state configuration and then being responsible for creating and sending the file to the external state’s SFTP folder. I won’t go deeply into microservices and the arguments for and against, but we have had very positive results in migrating a lot of our monolithic application to microservices, with the main benefits being a much faster, safer deployment, easier to locally debug and being more testable.

Our second major iteration ended up looking like the following:

Stage 2: Scooter

Rather than receiving a post request, the incoming test result is written to a queue on an Azure Service Bus. The StateReporter web app was replaced with a function that listened for a queue message and spun up when it received one. It would then perform the generic manipulation and then route it off to the individual state communicators, which were broken into their own Azure functions.

A function was a better fit than a web app, as numerous ones can scale up at once to cope with demand, while also not running while not being used.

This was a big improvement and would allow us to create a template of a state communicator and create them rapidly without the overhead of having to test and deploy an entire StateReporter web application. It also meant we could deploy at any point during the day, even during working hours, as the functions would read from a queue. This also meant if a function was to go down, the results would continue to build up in the queue until we deployed a fix.

We now had a lot more resilience and redundancy, and while we set up Slack alerts if a message was unable to be sent to a state after 10 attempts, our way of recovering this was still to manually retrigger the message, which led us to the next stage of auto recovery.

Stage 3: Auto Recovery

The issue we had was that if a state, for example New York, temporarily went down for a few minutes or a few hours, we would have a copy of the failed message and could manually drop this into the service bus queue to re-send.

As we started to add more states we saw this needed to become automated as it was taking up too much bandwidth. The solution was to build in a way to recover itself. Within each state communicator, our infrastructure now looked like the following:

The communicator was broken into three separate functions, which was an HL7 message builder, sender and auto recovery function.

The builder would construct the message ready to be sent, and the sender would only need to worry about sending the file. With an azure service bus queue, you have some built in redundancy of it retrying x (configurable) amount of times, although at the time of writing there is no exponential back-off retry available to use. This means that for anything longer than a transient outage, the retries would likely fail, leaving us with a manual step to retrigger it back into the queue.

What we implemented used Azure Event Grid, which emits an event. Once we successfully send the file to New York we then emit this event, which the AutoRecovery function is then triggered by. It would then spin up and try to fetch the messages in the failed messages blob storage. If there is nothing in there then it spins down, otherwise it places the messages back into the sender to re-send.

This means that we could safely store the messages and as soon as the first successful message is resent, the rest of the messages are retried. This is quite close to the circuit breaker pattern, as discussed here by Microsoft: https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

Another potential open would have been a Retry pattern, but this is arguably already provided in using the Azure Service Bus queues, so in our case we have a combination of the two, which has been working well.

We have now also scripted the creation of the communicator resources with help of Azure CLI. We are now able to create a new state and be up and sending them test messages in hours rather than days. The only issue is that we’re now running out of states to integrate…

Thanks for reading, and if anyone from Microsoft Azure Service Bus team happen to read this, my Christmas wish list would include adding support for exponential back-off retry. Cheers!

--

--