Azure Functions and Event Hubs: Optimising for Throughput

…or the power of parallel consumers and host.json

TL;DR. I take a fairly standard serverless event processing scenario — an Azure Function triggered by messages in an Event Hub — and talk about how how to optimise this system for throughput through: 
a) overall architecture 
b) EH partitioning, and
c) tweaking the Event Hub trigger host.json settings: maxBatchSize, prefetchCount and batchCheckpointFrequency. I doubt anyone would be interested in reading a yet another theoretical opus, so for my reader’s benefit I ran a series of experiments which involved testing multiple combinations of Event Hub partition counts and host.json settings, and observing the impact on latency and throughput. Feel free to jump straight through to the approach and results published to Power BI.

In other words, I’m trying to help those individuals who are at present staring detachedly in the middle distance trying to understand:

Why is my Azure Function suddenly struggling to catch-up / lagging a few hours behind. I thought the stupid cloud would scale out my stuff automagically…

Near real-time event processing is powerful and also complex since you’re no longer dealing with a traditional once-a-day batch job (that can run late, be rerun if needed, etc.) but with a distributed event-driven system, triggered by individual events (or micro-/mini-batches).

In Azure, a textbook serverless streaming scenario would be an Azure Function App triggered by messages in an Event Hub. It’s a common pattern, especially when your data requires a transformation to make it palatable for the rest of your pipeline.

Imagine you’re dealing with a stream of gzipped xml messages (happens more often than you think!)

Azure Functions’s native Event Hub trigger will take care of firing your code in response to events in the stream. While the trigger works great out of the box, when hitting scale (>10K msgs/sec) you will be required to tweak a couple of settings in host.json and function.json. By the way, everything discussed here applies across all languages supported by the Azure Functions Runtime, v1 and v2 (the latter is recommended due to improved performance). Please mind the syntactic differences in host.json between the two versions though.

Azure Function’s Event Hubs trigger is not exclusive to Azure Functions. The trigger is built on the Event Processor Host that you can use when building your own Event Hub consumers running in VMs, Containers, Webjobs, etc.

A Word About Measuring Latency And Throughput

When measuring latency, taking the average is almost never a good idea, it hides outliers; consider using a combination of percentiles and a maximum value (a good read if you want a deeper dive). Since we may not always have full control over event publishers and their network connection, decide whether you are going to measure end-to-end latency or pipeline latency (or both).

E2E latency and streaming pipeline latency

When measuring end-to-end latency you will require every message to contain a timestamp you can trust, also known as Application Time. When measuring pipeline latency, Event Hub’s enqueuedTime attribute reliably captures the time of arrival of each message. Please keep in mind that since the pipeline latency is based on messages’ time of arrival, it’s usually not a bad idea to implement a policy for late arrival and out-of-order events.

When measuring throughput, decide if you’re counting messages or bytes, both approaches are valid (which one do you think would be a better fit if your payload size fluctuates significantly?). Also, it’s recommended to measure sustained throughput, or a running average over a sufficiently long period of time.

Some throughput metrics can be used out-of-the box, for instance Event Hub’s Incoming Messages and Outgoing Messages. One way of measuring pipeline’s latency is via custom metrics in Azure Application Insights. All that’s needed is a couple of extra lines of code to calculate the difference between EventHub’s enqueuedTime and your code’s invocation timestamp and push this value out to AppInsights.

Here’s a half-decent sample SLA statement for a streaming scenario: 99% of all events arriving in the Event Hub must be processed within the hot path within 1 second. The maximum processing delay should never exceed 30 seconds. The pipeline should be capable of processing up to 15K messages per second 24/7, provided the message payload does not exceed 1KiB

Experiment

Let’s empirically measure how the throughput and latency is impacted by the number of EH partitions, batch and prefetch sizes. Since it will require quite a few iterations, let’s automate the process via Terraform.

For every iteration let’s create a 20 TU Standard Event Hub and an EH-triggered Azure Function. Let’s also provision a load generator to flood the EH with messages saturating the ingress. Our goal is to measure the resulting Throughput (T) and Latency (L) for various combinations of:
- P: the number number of Partitions in the Event Hub
- B: The maximum Batch size (maxBatchSize setting in host.json) and
- R:
The pRefetch Limit (prefetchCount setting in host.json)

Each iteration should be independent with a single EH per EH Namespace, one consumer group per EH. 20 Throughput Units should allow us to achieve a theoretical 20K msg/sec on the ingress (event publisher) side and twice that on the consumer (Azure Function) side.

The number of Event Hub partitions does not define its theoretical throughput. There is no specific throughput limit on an Event Hub partition (…anymore)

Azure Function is running in a Consumption Plan on Windows performing a null operation on every message (it could’ve been anything) and reporting the processing latency per batch of messages. You can report per individual message within each batch but you should assess how much more telemetry will be generated and adjust sampling accordingly.

If interested, look through the code and explore the results in Power BI or read on. To produce the Power BI report, I’m joining a couple of Event Hub metrics stored in Azure Log Analytics with the custom metrics in Application Insights via a cross-resource query in Kusto which in turn can be easily consumed in Power BI.

Run 1: 32 partitions, 20 TUs, maxBatchSize=64, prefetchCount=0

I’d call this ‘expected behaviour’: the pipeline’s latency (top chart) is in the hundreds of seconds in the beginning, as the Azure Function is gradually scaling itself out (bottom chart) until eventually it reaches 32 instances, which is precicely one consumer per partition per consumer group.

Remember, since we’re dealing with an Event Hub, it’s parallel consumers, not competing consumers
Top: latency, ingress, egress. Middle: batch size. Bottom: Azure Function instances

Once the Function is running at full throttle it eventually catches up with the backlog of messages in the Event Hub and the latency drops to around 220ms towards the end of the 15-minute run. The 5-min 95P and 99P latency values read 3.5 and 3.88 seconds respectively, if I had it running for longer these values would’ve dropped further.

The middle chart shows the average message batch that my Azure Function was being triggered with. Interestingly, it stays well below the maxBatchSize=64 which indicates some spare capacity in the processing pipeline.

What if we increase maxBatchSize to 512?

Run 2: 32 partitions, 20 TUs, maxBatchSize=512, prefetchCount=0

The throughput is very similar to the run above but the pipeline catches up much quicker (understandably, look at the batch sizes in the beginning!). The 5-minute P95 and P99 latency measurements are both 180ms which is pretty low if you ask me.

Okay, what if we keep reduce the number of Event Hub partitions to 4?

Run 3: 4 partitions, 20 TUs, maxBatchSize=512, prefetchCount=0

Ouch. The pipeline does not seem to ever catch up: look at the growing latency and the difference between IncomingMessages (that is, how many messages were enqueued in the Event Hub) and OutgoingMessages (how many messages are processed by the Azure Function).

Why? Not enough parallelism. There’s only 4 partitions, hence 4 consumers for the same 20 Throughput Units. The two options to consider here are:

  1. Increase partition count (you will have to recreate your event hub)
  2. Increase consumer’s performance to be able to reliably process at least 5K messages per second. If your processing is CPU-bound, moving the Azure Function to a Premium App Service Plan from a Consumption plan might help

There’s 57 more runs published to Power BI here for you to have a play.

Summary

There’s (roughly) four areas to look at to improve your pipeline’s throughput and reduce latency: Event Publishers, Event Hub, Event Hub trigger settings and Azure Function code.

Event Publishers

  • Write to EH using batches (mind the size limit!). Btw, this batch size has nothing to do with maxBatchSize
  • Use AMQP for efficiency
  • If reporting Application Time, use UTC
  • If using partition affinity, avoid creating hot partitions by choosing a bad partition key, this will create a skew on the processing side. If your scenario does not require FIFO or in-order processing (which can only be achieved within a single partition), do not specify the partition id at all for round-robin writes. Some more reading here

Event Hub

  • Choose the number of partitions appropriately, since it defines the number of parallel consumers. More details here
  • For high-throughput scenarios consider Azure Event Hubs Dedicated
  • When working out how many Throughput Units you require, consider both the ingress and the egress sides. Multiple consumer groups will compete for egress throughput
  • If you enable Event Hub Capture you can use the AVRO files landing on Blob Storage to trigger your cold path / batch processing, it’s a supported trigger too

Event Hub Trigger Settings: host.json and function.json

  • Explicitly set “cardinality” to “many” in the function.json to enable batching of messages
  • maxBatchSize in host.json: the default setting of 64 may not be sufficient for you pipeline, adjust, measure and adjust again. Keep in mind that editing host.json will restart your Azure Function
  • prefetchCount in host.json: the meaning of this setting is “how many messages to fetch and cache before feeding them in batches of maxBatchSize to the Function. I usually set it explicitly to 2*maxBatchSize. By the way, setting it to any value below maxBatchSize will have a negative impact on performance by reducing the batch size
  • batchCheckpointFrequency in host.json: have a look at the storage account associated with your Azure Function and you will see how checkpoints are stored as tiny json files per partition per consumer group. The default setting of 1 tells the Azure Function to create a checkpoint after successfully processing every batch. A batch will be considered successfully processed if your code runs successfully (you’re still responsible for catching exceptions). I usually start with the default value of 1 and increase this value a bit when I see throttling events on the storage account associated with the Function (things can get especially nasty when multiple Azure Functions share one storage account). The downside of increasing batchCheckpointFrequency is that, in case of a crash, your Function will have to replay more messages since the last checkpoint

Some additional reading here and here

Azure Function

  • Make sure your code is written to process events in variable-size batches
  • Use non-blocking async code
  • Enable Application Insights but carefully assess the amount of telemetry you require and tweak the aggregation and sampling settings in host.json accordingly
  • Disable built-in logging by deleting the AzureWebJobsDashboard app setting. By default, the Azure Function logs to Blob Storage and under high workloads you may lose telemetry due to throttling
  • Consumption Plan may not always be a good fit from a performance perspective, consider using Premium App Service Plans or instead deploying the Event Processor Host on appropriately sized VMs/containers
  • When dealing with Azure Event Hubs, there’s no concept of “locking”, “deadlettering”, etc. Make sure to handle exceptions at an individual message level. Great write-up on the subject is here

Thanks for reading and until next time!

Additional Materials