Capacity planning for Azure Event Hubs

Published in

Microsoft Azure

5 min readJan 18, 2024

A critical step in developing a system\application is the phase of ensuring that the infrastructure can handle the expected load and traffic and to be aware of the infrastructure limits.

For existing systems, it is crucial to analyze current usage, predict future growth, and scale the infrastructure accordingly to ensure the system’s health.

In this post we will focus on the Event Hub component which is a highly scalable data streaming platform that can ingest and process millions of events per second. It allows for low-latency and real-time data analysis and integration.

Event hubs names space properties

Azure provides some properties of Event Hubs that are relevant for load. Here are some of the most significant properties for us to inspect:

Tiers

Event hubs provides four tiers (Basic\Standard\Premium and dedicated) you should consider when performing the capacity planning. They have major differences not only in the scale capacity and limits they have but also in features they are providing like availability and auto inflate.

Throughput

Each tier allows to define throughput using number of units. In each tier the unit type name is different, but it has the same meaning of defining the capacity your namespace can reach for ingress (incoming messages) and egress (outgoing messages) in terms of:

Partition count

This is one of the key factors in capacity planning because this has major impact on concurrency — if you do not use batching and partition count = N you can process only N messages at the same time

So here are the steps you should take to conduct capacity planning for Event Hubs.

Step 1: Analyze your use case applicative requirements.

Consider the following:

Event Hubs instances you need for your flow.
Do you have one consumer of message or more — number of consumer groups.
Do you need to process messages by order — this is not supported by event hub so it might be that EH is not the right solution for you.
Do you need some messages to be processed “together” (by the same consumer) — this may be achieved using partition key.
How tolerant are you to data loss.
How tolerant are you to message duplication.
What is the maximum allowed latency for processing — the maximum allowed time since message is sent to the event hub and until it is received to be processed.

Step 2: Analyze your use case non-functional requirements.

What is the throughput you expect for each event hub in the namespace and for both Ingress and Egress:

What is the message size:

The range of message sizes
What is the distribution of sizes?
What is the maximum size you expect?

Processing Duration:

What is the distribution (percentiles) of processing durations?

Infrastructure consideration:

What is the location (region) — of the event hub namespace\the consumers\the producers.
Reliability level — Multi-AZ\Geo redundancy ext.

Cost:

What is a reasonable cost for you — for example will premium tier cost much more than standard?

Step 3: Select the required Event hub namespace Tier and properties.

By analyzing the above, you could have a good estimation of which properties of the event hub you need (Tier\Partition Count\Units) that will satisfy both your applicative and non-functional requirements.

Be mindful that certain decisions cannot be modified post-namespace creation. In such cases, creating a new namespace and migrating, such as in the case of Tier selection, may be necessary.

Note: It might be that after considering the aspect above you will discover that Event Hubs does not meet your requirements and you need to choose a different solution — for example you need extremely high throughput that will require a too expensive configuration, or you need more than the allowed max consumer groups.

Step 4: Run a benchmark.

Although you can have estimation on the required throughput and limits by only calculation, it is hard to consider all factors, especially in complex architecture. (e.g., namespace that has many event hubs with substantial amounts of consumer groups)

You need to run a benchmark that best simulates the load to see that you can handle it, and then increase the load to discover the limitations.

How to run the actual benchmark — one option is to run your application in a staging environment and artificially create load and monitor the metrics to get the results. This is a less recommended option since it may involve many other factors that may impact the load and it will make it hard to isolate the load on the Event Hub namespace itself.

The recommended approach is to run a dedicated benchmark tool. I searched for such and all I could find was extremely limited solutions that are built on Event Hubs Kafka integration, so I decided to implement one myself.

You can get it at: oshvartz/event-hubs-dotnet-messaging-performance: This repo can be used to help benchmark Azure Event Hubs throughput (github.com)

It is a simple CLI tool that can help you simulate the load on the Event Hub namespace. Its parameters allow you to first build your use case and then alter some parameters to optimize performance such as:

Sometimes you need to run multiple benchmark processes in parallel to build your use case: for example, if you have multiple event hubs in the same namespace.

The benchmarking step is NOT an optional step, and it is important especially when using premium\dedicated tires since there are no exact numbers you can find in documents for calculating the throughput by the number of PU (Processing Unit) you defined.

Step 5: Validate your decision.

Once you have the benchmark results you can answer the following:

Can it hold the current load needed.
The limits it can reach.

At this point you know if the selection you made in step 3 was correct and consider the growth rate (how much time it will take to reach the limits — how close are you to the limits).

Note: If the benchmark result does not satisfy your requirement you may need to go back to the previous steps

Conclusions

Capacity planning is critical and for complex systems and components such as Event Hub it is not a simple task. It is important to do it as early as possible since it can impact your solution and your infrastructure. The steps I mention in this article are similar for most of the infrastructure components you choose to use in your application. So, do not be lazy 😊 and do not skip the capacity planning step since it can be costly if you discover you did a mistake in late phases of development\or even at production.