In order event processing with Azure Functions
I met with a customer this week who was interested in using serverless but didn’t know if it would work for their business requirements. Here was the scenario:
Thousands of events are being sent from hundreds of different users at any given time. Each of these events needed to be processed and analyzed, and the processing must be done in order for each user.
They wanted the rapid development and abstracted infrastructure serverless brings, but their question to me was “Is there a way to guarantee processing events in a specific order when serverless can scale and be running on multiple parallel instances?” Below are details on how to do exactly that.
In-order messaging options
The first question to be answered is how to get the events to the Azure Function to begin with. HTTP or Event Grid wouldn’t work, as there is no way to guarantee that message 1 will arrive and be processed by your function before message 2 arrives and is processed. In order to maintain processing order, order needs to be persisted somewhere that my Azure Function can pull from. Queues work, but if using Azure remember that Azure storage queues do not guarantee ordering so you’d need to stick with Service Bus Queues or Topics. For this specific scenario though, Azure Event Hubs is the best fit.
Azure Events Hubs can handle billions of events, and has guarantees around consistency and ordering per partition. For the scenario above, the partition was clear: each users events would need to go to the same partition so consistency for all that users events would be guaranteed. This would be the first step in enabling in-order processing.
Trigger and processing in order
Now that the events are persisted in order with Event Hubs, I still have the challenge of processing them in order. This may seem extra complex with serverless technology, as a serverless function can scale across many parallel instances which could all be processing events concurrently. However there are some behaviors that allow us to enforce order while processing.
First, Azure Function’s event hubs trigger uses event hubs processor hosts to pull messages. The processor host will automatically create leases on available partitions based on how many other hosts are provisioned. There is another important constraint that works in our benefit in this case: a single partition will only have a lease on one processor host at a time. That means multiple function instances can’t pull messages from the same partition. This also means your function will only ever scale to n+1 instances, where n is the number of partitions for the event hub. It’s n+1 because we keep one running and ready to immediately start pulling messages from a partition if an instance goes away for any reason (you aren’t charged for this overprovisioning though — because serverless is awesome).
So as long as I know the events that need to be kept in order are in the same partition, I can also guarantee that a single instance of Azure Functions will receive those messages at one time. But are those messages preserved throughout execution? Let’s run some tests.
Processing a single message from Functions
For my first test I have a simple Azure Function that will trigger on a message from Event Hubs and add it to an ordered list in a Redis cache. I will publish 1000 messages — in order — for 100 users in parallel (with each user ID being a partition key). I have 20 partitions in my event hubs, but that’s ok. Multiple users events may land in the same partition, but I still know all events in a single partition have order preserved, so it all works out.
After my functions trigger and process all messages, I can then compare the order in Redis cache to the order of the messages published.
Here’s what the first Azure Function looks like — I’m just going to pull in a single message and immediately push it to the end of the Redis list.
After sending 1000 in-order messages for 100 users in parallel, here’s what one of the user lists looked like in Redis:
Notice anything? While the order is pretty close, it’s not perfect. You can see a few examples above where one message got processed before or after its order. So what’s happening here? The answer is that even though a single instance of a function app gets all messages in a partition, it may process them concurrently. Specifically, when the Azure Function’s event hubs trigger grabs messages, it pulls them off in a batch (as many as Event Hubs will give it, usually around 200 messages but can be more). So even though the messages are initially pulled in order, they are then processed concurrently causing some items to complete before others.
To verify this, I configured the “maxBatchSize” for my function to 1. Now my function will only ever pull a batch of 1 message. After running my second test with this setting, I did in fact get perfect order, but it came at a cost. The function took a long time to burn down all of the messages, because it was having to go fetch each message individually from event hubs. Fetching each of the 100,000 total messages individually took a while to process. Luckily, there is a better way.
Triggering in ordered batches
I can take advantage of the throughput benefits from pulling batches from event hubs, and preserve order. The trick is simple: instead of passing a single event hub message into a function execution, I’ll pass in the entire ordered batch. I can then process each message in order, without the costly call to event hubs for each message. Here’s what the in-order batched function looks like:
Now when running the same test (sending 1000 messages in-order for 100 users in parallel), the results were just what I wanted. My function was able to process the 100,000 total messages in a few seconds, and when checking the results in Redis cache? Perfect order, for each of the 100 user’s lists. Event Hubs is preserving each user event order in partitions, functions pulls in ordered batches from available partitions (scaling out as much as possible), and maintains that order while processing the batch.
Hopefully this example has been helpful. I have the entire working solution (including the console app that sends the 1000 events for 100 users) in GitHub here.