Improved EventBridge Latency Opens Up New Use Cases at PostNL

Luc van Donkersgoed
PostNL Engineering
6 min readFeb 20, 2023

--

PostNL has fully embraced AWS Serverless technology for its in-house development projects. EventBridge takes center stage in our large-scale event-driven application landscape — it routes millions of events between dozens of applications every day. With the recently introduced latency improvements, we will be able to add additional industrial use cases to our EventBridge-based integration portfolio.

EventBridge as a centralized event broker

At the time of writing, PostNL employs 37,000 people and delivers an average of 1.2 million parcels per day. The software required to control this massive logistics network is built in-house, by dozens of autonomous teams building fit-for-purpose, event-driven applications. These applications need to exchange their events, and to control and support this complex mesh network of data integrations PostNL maintains a central event broker. The event broker’s beating heart is Amazon EventBridge, with a shell of supporting services responsible for event consistency, contracts, retries, replays, and providing insights into volumes, latencies, and event sizes.

The central event broker serves three primary use cases:

  1. It allows event producers to register their events and their schemas, and will provide dedicated endpoints for producers to publish these events to.
  2. It provides a catalog of registered events, used by potential consumers to find the events they might be interested in.
  3. It allows consumers to subscribe to a specific event, which will configure the event broker to forward the events to the specified endpoint.

This approach has yielded great results. At the time of writing we’re receiving about 97 million events per month, and we forward about 119 million events per month (a fan-out ratio of 1,23).

EventBridge latency

Up until February 17th, 2023, the major downside of EventBridge has been the latency introduced between the producer and the consumer. Consider the following historic trace in our observability tooling:

The left side of the waterfall chart shows that the event broker receives the event and validates it matches its registered schema. This takes just over 18ms.

The right side of the chart shows that we forward the event to two target endpoints. These take 188ms and 58ms, most of which is due to HTTPS latency and the response time of the downstream system.

The gap in the middle is the time from the moment the event was published onto the EventBridge event bus, until the moment it was received by the EventBridge target. In this example the latency introduced by EventBridge was 388ms.

The example above is actually faster than most events at the time. Measuring the latency across millions of events — before the latency improvements — yields the following graph:

The P50 measurement is 438ms, and P90 sits at 654ms. Fun fact: latency significantly increases up to 483ms (P50) / 747ms (P90) when the system is experiencing peak load.

The problem with latency

For most event-driven systems, latency up to 1 second is not an issue. If you’re sending an email, invoice, or push notification, almost no-one cares if it is delayed by milliseconds or by a second. But this is not true for industrial systems, like the sorting machine in one of PostNL’s distribution centers shown below.

This machine continually sorts parcels — hundreds of thousands per day. You can’t see it in a picture, but these machines move fast. Every second, dozens of decisions about how and when to move a parcel are being made. If these decisions take too long the entire machine will perform below its peak performance, which will lead to delays and queues of parcels, vans and trucks.

If the sorting application was a simple solution, the latency introduced by EventBridge might look like the diagram above. Because the observation event and the sorting decision event both pass through the event bus, 900ms of latency is introduced just by EventBridge. Latency generated by physical distance and actual computation time can easily increase the total time from observation to sorting action to 1.5 seconds.

But the sorting application is not a simple solution, and it itself depends on other downstream systems. In reality the event flow might look like this:

In this scenario, EventBridge by itself introduces 4,75 seconds of latency — and compute time and network latency will be added on top of that. In 5 seconds, a parcel has moved quite a distance and it is very likely it has moved passed the chute it should have been transported to.

The example clearly shows that 450ms of latency can significantly impact an event-driven landscape, and will make a service like EventBridge unsuitable for certain use cases.

February 2023: Improved EventBridge latency

On February 17th 2023, EventBridge latency suddenly dropped by 67%. The P50 measurement is now 144ms (down from 438ms), while the new P90 is 231ms (down from 654ms).

The waterfall chart of a single event shows the same reduction. The EventBridge gap in the middle is much smaller, the entire flow takes less time, and a larger percentage of the chart is claimed by actual computation.

As the central event broker currently processes about 4.5 million events per day, this change saves us 1.3 million seconds (≈15 days) of latency. Every day.

But more importantly, this improvement unlocks new industrial use cases like the one outlined above. Where the old 450ms EventBridge latency would result in almost 5 seconds of end-to-end latency, the new 145ms delay would bring that down to about 1 second — a period of time in which the parcels have not moved that far down the sorting line, leading to correct sorting behavior and no sorting delays.

Conclusion

EventBridge event buses have quickly become an essential component in the serverless landscape. They’re versatile, easy to configure and cost effective. They make decoupling services a breeze, while supporting evolvability of your applications. Until now the biggest concern was the latency they introduced, but at 145ms this is no longer a significant issue.

Now.. let’s see if we can get EventBridge FIFO support too — it would unlock even more use cases 😅

--

--