Monitoring Lag for Premium and Dedicated Azure Event Hubs

Stefan Hudelmaier
4 min readMar 26, 2023

--

The lag indicates how well Event Hub consumers are keeping up with producers. A lag close to 0 indicates that the consumers are fast enough. A higher lag, especially one that keeps on growing, is a hint that consumers must be optimized or that they must be given more resources. Monitoring the lag of an Event Hub is very important for building reliable systems.

There is no built-in way to monitor the lag of Event Hubs in Basic and Standard SKUs (but you can implement it yourself or use a ready solution from the Azure Marketplace).

There is a built-in lag metric in the Premium and Dedicated SKUs, however. We will show you how to use this metric in order to properly monitor your Event Hubs in these tiers.

The metric ConsumerLag is part of the Application Metrics Logs. These are part of the diagnostics information that you can enable for the Event Hub Namespace or cluster. We will first configure things interactively using the Azure Portal and later on automate it using Bicep.

Configuration via the Azure Portal

Go to Diagnostics Settings > Add diagnostics settings and create a new diagnostics configuration that includes Application Metrics Logs. As the destination select Send to Log Analytics Workspace and choose one of your existing workspaces. Do not forget to save the settings.

The metric will now be exported to your Log Analytics Workspace. It does take some time to appear, up to 15 minutes from our experience.

You can find the data in Log Analytics Workspace > Logs. You can use the following query to extract the relevant information from the rather verbose log entry:

AzureDiagnostics
| where ActivityName_s == 'ConsumerLag'
| project
ConsumerGroup = ChildEntityName_s,
EventHub = EntityName_s,
PartitionId = PartitionId_s,
Lag = Count_d,
Timestamp = eventTimestamp_s
The metric and query in the Log Analytics workspace

You can now use this data to configure an alert. There will be an example later on based on Bicep.

Configuration of the metric via Bicep (ARM Templates)

You can automate enabling the Application Metric Logs using Bicep using the following snippet.

resource eventHubs 'Microsoft.EventHub/namespaces@2021-06-01-preview' = {
name: 'myeventhubs'
location: location
sku: {
name: 'Premium'
tier: 'Premium'
capacity: 1
}
properties: {
disableLocalAuth: false
zoneRedundant: false
isAutoInflateEnabled: false
kafkaEnabled: false
}

resource logAnalyticsConnection 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
name: 'diagnostics'
scope: eventHubs
properties: {
logs: [
{
category: 'ApplicationMetricsLogs'
enabled: true
}
]
workspaceId: logAnalyticsWorkspace.id
}
}

We first define the Event Hub (Premium SKU is required, otherwise the Application Metrics Logs are unavailable). The diagnostic settings are a so called extension resource, so they are not a child resource of the Event Hub Namespace but are referenced instead via scope.

Let’s create an alert rule now for this metric.

resource lagAlert 'microsoft.insights/scheduledqueryrules@2022-08-01-preview' = {
name: alertRuleName
location: location
properties: {
displayName: alertRuleName
severity: 3
enabled: true
evaluationFrequency: 'PT5M'
scopes: [
logAnalyticsWorkspace.id
]
targetResourceTypes: [
'Microsoft.OperationalInsights/workspaces'
]
windowSize: 'PT5M'
criteria: {
allOf: [
{
query: 'AzureDiagnostics | where ActivityName_s == \'ConsumerLag\' | project ConsumerGroup = ChildEntityName_s, EventHub = EntityName_s, PartitionId = PartitionId_s, Lag = Count_d,\n Timestamp = eventTimestamp_s'
timeAggregation: 'Maximum'
metricMeasureColumn: 'Lag'
dimensions: [
{
name: 'ConsumerGroup'
operator: 'Include'
values: [
'*'
]
}
{
name: 'EventHub'
operator: 'Include'
values: [
'*'
]
}
]
operator: 'GreaterThan'
threshold: 100
failingPeriods: {
numberOfEvaluationPeriods: 1
minFailingPeriodsToAlert: 1
}
}
]
}
autoMitigate: true
actions: {
actionGroups: [
actionGroup.id
]
}
}
}

This will raise an alert once the lag has exceeded 100. It is important to note that unfortunately deploying this will only work once the first metrics have been exported to Log Analytics. Otherwise you will get a SEM010 — Semantic Error — ‘where’ operator: Failed to resolve column or scalar expression named ‘ActivityName_s’. For this reason you cannot define the alert rule in the same Bicep deployment as the Event Hub, Diagnostics settings, etc, which is a bit of a bummer. The most complex thing in the above snippet is the query which is identical to what we showed above. You can tune the settings numberOfEvaluationPeriods, minFailingPeriodsToAlert, windowSize, evaluationFrequency, etc. to your liking.

Summary

Monitoring the lag of Event Hubs is very important. If you use Premium or Dedicated SKUs, you can apply this guide to leverage a built-in mechanism of Event Hubs. If you use the Basic or Standard SKU, consider to use Lag Metrics, a product of ours on the Azure Marketplace for your lag monitoring needs.

--

--