Performance gains by using an event-driven architecture

Madalin Giurca
ING Hubs Romania
Published in
6 min readOct 2, 2023

When it comes to time, we all think about it as one of the most valuable resources. Thus, ignoring it while building software is a no-no.

Photo by Veri Ivanova on Unsplash

A considerable benefit of the microservices architecture is that — if the communication rules and protocols are followed — each service can use its own technology. This leads to better performance by picking up a technology that provides the best performance for the function associated with that service. Going away from a monolith application can be represented by a single keyword, decoupling. Decoupling can start from the development life cycle of the whole application and can lead to the runtime environments and systems where various services are running.

But how to communicate efficiently in a microservices mesh?

With all the decoupling mentioned before, it is a constant in the architecture: communication means used by the services to exchange information. This is the main drawback of the microservices architecture, not present in a monolith application where data exchange between parts of the application can be considered instantaneous. Selecting the best communication means for the use case of the application can make the difference between failure and success.

A common way to have the services talk to each other is by using REST (Representative State Transfer) calls over HTTP/HTTPS protocol. HTTPS (or HTTP) can be very helpful due to its ease of use. But in larger chains of functions where multiple services are involved response times can increase out of our control. Some attempts to minimize performance degradation are made by performing asynchronous REST, but it might be a workaround, not a solution.

Event-driven architecture aims to solve the issue when the data exchange between services starts to become the bottleneck of a system. This implies that instead of a service asking for a data change to another service, it fires an event correlated to the need for a data change. The service owning the data is responsible for monitoring and consuming such events and updating the data according to its business logic.

The critical aspect is the event broker which needs a high fault tolerance not to lose any event and to make it available to its consumers as soon as it is fired.

More than that, the decoupling is even higher when events are used as a communication means. A new system component can be easily introduced to perform additional actions when something happens. The only required information is the structure of the event and the bus where it can find that event. Thus, the new service doesn’t need to know anything about the application producing those events. But, this can be achieved only when the events are generic, based on producer knowledge only. Not to be confused with messages sent on an event bus. These are coupled to some specific logic and their destination is known before it reaches the event bus. As a disclaimer, in the laboratory setup, everything traveling through the bus is meant to be an event, not a message.

Theoretically, it sounds like an improvement, but let’s put it to test!

Laboratory setup

The whole project alongside the performance testing setup can be found on my GitHub repository.

Let’s imagine a part of a system dealing with customers' orders and providing functionalities, such as:

  1. Picking up new orders
  2. Updating order statuses based on payment and delivery status
  3. Providing details about the status order to the customer

This can be the laboratory setup of a system consisting of a group of microservices with limited responsibility. Now all we need is to have the same microservices talking to each other through REST for setup a) and using events sent over a Kafka bus for setup b).

As a disclaimer, there’s some extra complexity added to the system to better evaluate the performance gains of the event-driven approach.

a) interactions between the 4 services used to build our system
b) the same 4 microservices, but now using a Kafka bus to fire and listen to state changes

Performance testing

In order to isolate the testing, one Azure virtual machine resource was provisioned. The specs of the Azure virtual machine under test are as follows:

  • Size: Standard D2s v3
  • vCPUs: 2
  • RAM: 8 GiB

Running Ubuntu 20.04, Docker was installed on the machine to bring up the two setups of Kafka or REST communication between the services.

Using Gatling, from the local machine, the setup was put under stress by varying the parallel number of users. The scenario is specific for testing and tries to mimic the user’s behavior while interacting with the system. That’s the main reason for the pauses’ presence in the scenario description.

ScenarioBuilder scn = scenario("Place order and checks until delivery ends")
.pause(1, 5)
.exec(UserInteractions.createOrderRequest)
.exitHereIfFailed()
.pause(1)
.exec(UserInteractions.monitorOrderRequest)
.exitHereIfFailed()
.pause(5, 10)
.exec(UserInteractions.approvePaymentRequest)
.exitHereIfFailed()
.pause(2)
.asLongAsDuring(UserInteractions.orderNotFinalized, ofSeconds(80)).on(
exec(UserInteractions.monitorOrderRequest
.check(jsonPath("$.orderDetails.orderStatus").saveAs("orderStatus"))
).pause(2, 15)
);

Results

Aggregating the data from Gatling results, we can easily notice that as soon as the number of parallel active users injected into our scenario increased, the system based on REST communications started to fail. Sooner than Kafka.

percentage of failures
mean response time
percentage of failed requests based on load

More than that, Kafka implementation had a better resource consumption, by using as much CPU as possible, while also delivering better results. Whereas REST implementation goes up and down in CPU usage, pointing towards inconsistency. A possible cause for the inconsistency might be some parts of the system working while others are waiting. Leading to transition windows. This might have a direct implication on the costs associated with running a service capable of serving many customers.

CPU usage of the Azure VM resource while running the tests. The first half identifies the part when Kafka implementation was under test. The second half, after 3:00 PM, shows the CPU usage of the REST implementation.

What if the user is as fast as a robot?

One interesting result popped up as soon as we removed an arbitrary pause between creating an order and asking for its status.

    ScenarioBuilder scn = scenario("Place order and checks until delivery ends")
.pause(1, 5)
.exec(UserInteractions.createOrderRequest)
.exitHereIfFailed()
// .pause(1)
.exec(UserInteractions.monitorOrderRequest)
.exitHereIfFailed()
.pause(5, 10)
.exec(UserInteractions.approvePaymentRequest)
.exitHereIfFailed()
.pause(2)
.asLongAsDuring(UserInteractions.orderNotFinalized, ofSeconds(80)).on(
exec(UserInteractions.monitorOrderRequest
.check(jsonPath("$.orderDetails.orderStatus").saveAs("orderStatus"))
).pause(2, 15)
);

The above scenario is identical to the one initially tested, but with the pause of one second commented out. Surprisingly or not, this generated immediately a failure rate for the Kafka implementation of 2% for the monitor order interactions. All the failures related to the order not being found. This is due to the delay of the event (create order) reaching the bus and being consumed by the other services. Where the REST communication was chosen, there were 0 failures for low levels of load.

The effect grows bigger as we increase the load. It can be easily spotted in the below figure. The differences between the two implementations are lower compared to the initial assessment. More than that, even for low loads there was no 0% failure rate when looking at the Kafka implementation.

percentage of failures when the 1-second pause is commented out

Having the above scenario in mind, one important aspect can be observed. Where a system relies on Kafka implementation, the consensus is eventually reached. But no one can guarantee when this will be achieved. Where words like predictability, automation, and testing are often used, then we should think twice before going towards an event bus architecture.

And more than that, we have an extra critical component in the middle when using an event-driven architecture. If the event bus is not a robust system and fault-tolerant, we can find ourselves in a situation where not even a single service is able to do its job.

Thanks for reading about my experiment and feel free to stress yourself the setup uniquely and imaginatively! As mentioned, you can find everything on GitHub. 🫡

--

--

Madalin Giurca
ING Hubs Romania

Software engineer on a mission to learn and improve performance and security of any application or architecture. Also, one big enthusiast in machine learning.