Business process visibility & event-driven services

11 min readDec 29, 2019

Modern microservices architectures are event-driven, reactive, and choreographed especially with the rise of event streaming platforms like Kafka. Choreographed services bring so many advantages and solve many issues of orchestrated services. For more details, please check Why event-driven architectures are important today. In this story, I will focus on the challenges of using event-driven in applications with business process workflow.

Let’s imagine we have a simplified version of an order management system where customers can place orders, invoice the customer, reserve stock, and finally ship the order to the customer. Before we start let’s quickly refresh our knowledge about orchestrator and choreography patterns.

Orchestration

The microservice orchestration approach is more like a centralized service. It calls one service and waits for the response before calling the next service. This follows a request/response type paradigm. The following diagram shows how our simplified order management could be designed with orchestration.

Order management with orchestrated services

Choreography

Microservices need to communicate to form a bigger useful business process. In point-to-point orchestrator pattern, there is a central brain that drives the process and delegates some actions to other microservices by sending commands in the form of request/response.

On the other hand, the choreography pattern is an event-driven architecture pattern applied to microservices. Instead of having a central orchestrator that controls the logic of what steps should be followed from the start to the end. Each service knows what to react to and how, ahead of time, like a marching band performing their big number after months of practice. Services use an event stream for asynchronous communication of events. Multiple services can consume the same event, do some processing, and then publish the result back into the event stream in parallel. The event stream does not have any logic and is intended to be a dumb pipe. For more details check Microservices — When to React Vs. Orchestrate. The following diagram shows how choreographed services interact using event streaming platform. We can easily add more services to consume the same event stream to support more business features.

Choreography services — order management

Business process & events

A business process can be implemented as a set of workflow steps. With event-driven architecture there is a starting point where the flow starts(in our prototype this is an order-service's REST endpoint to place the order). Each workflow steps is a choreographed service which performs some functionality (like stock-service in our prototype which reserves the stock for all order items). When the order goes through all the steps to the end, it becomes complete. As a result of this architecture, there is no end-to-end visibility of this business process because the logic is distributed among those self-working services. The end-to-end process is hidden by events that flow between services. The following diagram shows the flow of events between services forming a simplified order management workflow. Each service is reacting to events it is interested in and publishes the outcome back as another type of event. If an order is stuck in any step of the workflow, then all remaining steps will not be executed.

Event-driven architecture comes with some challenges. In the following section I will be focusing on some of the challenges that are related to using choreographed services in a business process workflow.

Challenge 1: End-to-end flow visibility

As Martin Fowler mentioned in his blog what do you mean by event-driven?

The danger is that it’s very easy to make nicely decoupled systems with event notification, without realizing that you’re losing sight of that larger-scale flow, and thus set yourself up for trouble in future years. The pattern is still very useful, but you have to be careful of the trap.

What we need here is a way to draw the flow in which orders go through in real-time. Any code change in the workflow(add/remove or shuffle services in the flow) should be reflected immediately in this tool.

Challenge 2: Service Level Agreement(SLA) of workflow steps

Given that orders are moving through many services and there is no central orchestrator that has the knowledge about the in-flight orders. We need to make sure orders are not stuck in any workflow step, and if they are, then we need to let other workflow services know. This means that each service should track the orders they are currently processing especially if they need to do some asynchronous processing. For example, stock-service receives order-placed event to reserve the stock of all order items. To do this job, stock-service needs to integrate with a legacy system in an asynchronous request/response fashion. As you know, there are so many reasons for this request to fail, so we need to make sure that stock-service can detect this scenario and timeout this order-placed instance(by publishing order-expired for example).

Challenge 3: SLA of end-to-end flow

Even if each workflow step is designed properly to track orders and timeout when necessary, we still need to track the end-to-end flow of processing orders to make sure they are completely processed and not stuck in any step. Let’s imagine that any workflow service is lagging and not able to catch up, or even the service is down for any reason. This might impact the whole end-to-end processing flow as orders will be stuck in this struggling service. To solve this issue we also need to track the orders going through the workflow from the beginning till the end, and timeout when this does not happen.

Challenges — workflow & SAGA pattern

With distributed systems built with choreographed services, there might be a need to support distributed transactions(SAGA) to compensate some actions in case of failure. Let’s imagine the following scenario happen

A customer placed a new order
Payment service processed this order, created an invoice and deducted the total amount from the customer’s account.
Stock service received the order to reserve the stock of all items. For some reason, the service could not complete the processing, so it decided to timeout the processing of this order.

In this case, the payment service should rollback what it has done for this order. The same concept should be applied to all services of the workflow. This means that whatever the approach we choose to support workflow step tracking or even end-to-end tracking should be real-time and reliable to be part of the business process.

Maybe Distributed tracing?

The first tool I thought about to tackle those challenges was distributed tracing tools like Zipkin. Those tools are very useful in tracking execution flow in distributed systems. In this blog Distributed tracing for Kafka based applications, Jorge Quilcate summarised the importance of using a tool like Zipkin to achieve this great visibility. We could now debug the flow of a specific order in all services involved in a distributed workflow. Also, we can see how long it takes in each step and many other features.

But, this does not answer some of our original questions, Zipkin is a passive tool to understand what is happening to maybe log or alert. It might not be helpful for real-time business decisions like expiring orders for example. Because of that, we needed to find another solution where we can track orders while they are being under processing, and react when they are stuck.

What about operational events?

The idea is to use events to support such requirements, call it operational, business tracking, … or even X events. If each service in the workflow tells what they have done related to our bounded context entity(in our prototype this is Order) and causes this entity to move between workflow steps till the end. Basically it is like using events to track the flow of business events, looks weird, is not it? Andrew Jones and Inny So gave a great talk in Kafka summit which might be related to this idea Observability for everyone. For this solution to work properly, some key design requirements have to be considered

Publishing operational event should be in the same Kafka transaction as original business events that services usually publish.
Operational events are published to the same Kafka topic, and because they are all related to the same bounded context, they should be using the same key(OrderId in our prototype). Using the same key here means strong ordering guarantee naturally supported by Kafka.

publishing operational event and other business events

The following code snippets show how operational events are published using Kafka producer(using spring-kafka) and Kafka Streams (using spring-cloud-streams).

Publishing operational event with spring-kafka

Publishing operational event with spring-cloud-streams

Operational events & business process

Let’s come back to our original problems, how could these operational events help us answering the original questions. From what we described till now, we have a set of events published in the same order of workflow step execution, all stored in the same topic. This means we could consume this stream of events to track the actual flow of orders in real-time. The following KsqlDB query is listing the accumulated workflow steps of placed orders in real-time in exactly the same order of execution. From this video, we can see how the orders are progressing in the workflow. We can easily build a UI tool to draw the workflow steps of any placed order.

Placed orders progressing in the workflow (real-time tracking)

Moreover, the same operational event stream could be used to track the SLA of each workflow step, and even the SLA of the end-to-end workflow. In the following diagram, we can see how order-tracking-service (a separate deployable service) is tracking in-flight orders, in case an order is stuck in any step, an event of type OrderExpired will be published. In fact, the code to implement this feature is very straight forward, for more details please check the order tracking service code. The transformer is keeping track of the in-flight orders as long as they are not completed in a Kafka local state store.

Using this concept we could think of many use cases that can be easily supported

Support different SLAs for each workflow step. In the prototype, each step is configured with different SLA duration.
Implement SAGA pattern by running the workflow services in compensation mode for expired orders. For example, if an order is stuck in shipping-service, then the stock-service should be able to cancel the reserved stock of that order when OrderExpired event is published.
Handle different corner cases in a more elegant way. For example, if an order was stuck in shipping-service for a long time, as a result, the order-tracking-service decided to expire this order. Later, the shipping-service woke up to process this order, it would ignore it because it is expired already. Or even in the worst-case scenario, if the shipping-service published its result for that expired order. The order-tracking-service should be able to pick this up and do not publish completed event for this expired order, and instead, it could alert for manual intervention for example.

The same stream of events could be used to drive some useful real-time metrics like

How long orders take in each workflow step to derive the average processing time in real-time.
How many in-flight orders are being processed in each workflow step at any point in time.
The rate of expired orders and in which workflow step orders are struggling at any point in time.

Prototype source code

The source code of the prototype can be found in this repo. The code contains the following services

Order service: this is the REST API (front door) where users can place their orders.
Payment service: it consumes order-placed event to invoice the customer.
Stock service: it consumes order-placed event to reserve the stock for all order items.
Shipping service: it consumes the events published by stock service to ship the reserved stock to the customer. There will be a future work in this service to support shipping individual order items once they are ready. When all order items are shipped this service publishes an order-shipped event.
Order tracking service: it tracks the in-flight orders to mark them as completed by publishing order-completed event, or expired by publishing order-expired.
Order monitoring service: this is consuming multiple events alongside the workflow step completed events to publish metrics to Prometheus.

Operational events challenges

As we know any solution has some challenges, this approach is not an exception. There are some challenges to adopt this approach

Overhead of publishing extra events. Developers need to make sure they publish an operational event with correct data whenever they implement a new workflow choreographed service. The code could be wrapped in a utility library, for example, but still, developers need to add some few extra lines of code.
Overhead of long-lasting workflows. In scenarios where the workflow lifetime is too long (for example, if order processing time is in days) there might be disk and memory overhead to track the in-flight orders especially with millions of orders a day.
There might be a single point of failure when moving the logic of tracking the processing time in each workflow step and also the tracking of the end-to-end workflow processing to a common service (For example, in our prototype, the order-tracking-service is doing this job). If this service is lagging or not working at all, the whole tracking functionality will not be working, compared to implementing this tracking logic in each and every workflow step. This means that this kind of tracking service should be monitored properly just like all other business workflow services.
It might be challenging to accurately draw the flow of parallel workflow steps, as the order of operational events will depend on which step finishes and commits the Kafka transaction first.

Live demo

In this video, I am trying to send some orders to demonstrate the idea of operational events.

The order workflow services are working together to handle placed orders.
The order-tracking-service is working in the background to track in-flight orders to mark completed and expired ones according to the case.
The order-monitoring-service is working in the background to consume published events and KsqlDB aggregation results to publish different metrics to Prometheus.

Resources

Prototype source code
Observability for everyone by Andrew Jones & Inny So in Kafka Summit. A great talk about using events to replace logging and metrics to get a deeper understanding of the applications.
Why service collaboration needs choreography AND orchestration