Consistency Problems in A Microservice Architecture (Part III)

Wenbo Zong
7 min readJul 10, 2019

This post is final part of a three-part series on the consistency problems arising in a microservice architecture. Finally, we are talking about distributed transactions.

Distributed Transactions with Event Sourcing

Event Sourcing is a more sophisticated architecture than event notification. At the conceptual level, they are distinctly different in that the event producer does not care about subsequent actions in the event notification architecture, whereas it does in the event sourcing architecture. However, in terms of implementation, the difference is much finer and more blurred. A great introduction to event sourcing can be found here.

Handle Distributed Transactions with Event Sourcing

Consider the following high-level microservices architecture of an e-commerce system. In this example, suppose the sequence of actions are: create a pending order (Order service) ⇒ reserve stock (Inventory service) ⇒ make payment (Payment service) ⇒ conclude order (Order service) and asynchronously send notification (Notification service).

High-level system view of the e-commerce application

For the happy path, the sequence of events are as follows:

  1. Order service creates a new order, marked as “pending” state, and publishes an event named event_order_pending.
  2. Inventory service listens to event_order_pending, reserves the stock and publishes an event_stock_reserved.
  3. Payment service listens to event_stock_reserved, executes the payment logic and publishes an event_order_paid.
  4. Order service listens to event_order_paid, updates the order to “created” state and publishes an event_order_created. Then the Order service returns the result to the client.
  5. Notification service listens to event_order_created, sends a notification to the customer.
Event sequence of the happy path

Let’s see what happens if payment fails:

  1. Payment service publishes an event_order_payment_failed.
  2. Both Order service and Inventory service pick up event_order_payment_failed and

a. Order service marks the order as “failed”.

b. Inventory service releases the reserved stock.

Event sequence of one unhappy path

Pros and Cons

One striking feature of the event sourcing pattern is that there is lots of tangling between the services through events. Remember we mentioned in the beginning of this essay that events are also part of a service’s interface? That means these services are closely coupled in terms of business logic, and there could be cyclic dependencies if not designed carefully. This coupling adds one more item to the already long list of complexities of event sourcing (quoted from Martin Fowler’s presentation):

  • Event schema
  • Unfamiliar
  • Asynchrony
  • Versioning

All these complexities would make the design more difficult, and testing and troubleshooting more challenging.

Nonetheless, if we do want to handle transactions with event sourcing, we need to apply the local transaction technique discussed in the Section Workflow Consistency in EDA to ensure atomicity of local update and messaging.

When is ES suitable?

  • When you need an audit log, e.g. accounting application
  • When you high performance and scalability

Distributed Transactions with A Central Coordinator

We will use the same e-commerce example described in the previous section. The basic idea is to have the Order service function as a coordinator for the order creation task, hence the high-level architecture diagram looks like this:

High-level architecture of the e-commerce application

The flow for a successfully created order is as follows:

Sequence diagram of the happy path

The flow for a failed order due to payment failure is as follows:

Sequence diagram of one unhappy path

In this scenario, we need to roll back the reserved stock when the payment fails. Generally there is only one happy path and many different failure scenarios, and the Order service, as the coordinator, must handle all scenarios correctly. One practical way is to implement a state machine in the Order service, and the Order service must persist the state machine after each operation so that a failed transaction can be recovered forward or rolled back. One possible design of such a state transition is illustrated below:

State machine for the coordinator

A few elements must be in place to enable the implementation of this state machine:

  • There must be a unique transaction id that is used across the Order service, Payment service and Inventory service.
  • The Payment and Inventory services must be idempotent, so that the Order service can retry the same operation without worrying about unwanted side effects.
  • For each operation, the Payment and Inventory services should provide a corresponding compensation operation to roll back the previous operation. Again, idempotency must be supported in the compensation operation.
  • A cron job is needed inside the Order service (i.e. the coordinator) to repair failed transactions. The cron job must avoid race conditions with the normal execution of the state machine.

As you can see, even though the idea is simple, it actually requires lots of effort to implement it right, with strict semantics requirements imposed on the Inventory and Payment services.

Compare with the Saga Pattern

Earlier I mentioned that the way I describe as a workflow or a transaction is actually the same as the Saga pattern. Even though I didn’t intend to describe the Saga patterns (plenty of online materials and a few references are given at the end of the essay), a quick comparison may help to put things into perspective.

There are two popular implementations of the Saga pattern:

  • Event/Choreography: There is no central coordinator, and each service listens and publishes events and act upon events.
  • Command/Orchestration: There is a central coordinator to manage the sequence of actions at one place.

Clearly, the event sourcing approach described above is the same as the Saga choreography pattern. However, there is one important difference between my coordinator approach and the Saga orchestration pattern: I have used synchronous request/response RPCs instead of messaging. This communication mechanism difference actually has significant consequences, as summarised below:

Central coordinator with difference communication mechansims

Therefore, it may be a good idea to start with RPC but design a clear separation of the interface adaptation layer and the logic layer, so that RPC can be replaced with messaging later with minimum effort.

Comparing the Solutions

The Differences

Event Sourcing

  • Asynchronous
  • Performant
  • Less familiar programming style
  • Business logic is distributed, and more coupling between services
  • Possibly higher latency due to the message broker
  • Difficult to add new step in a transaction
  • Difficult to reason about
  • Risk of cyclic dependency
  • More difficult to test and troubleshoot

Central Coordinator with RPC

  • Synchronous
  • More familiar programming style
  • Centralised logic, and loose coupling between services
  • Participant is simple
  • Lower latency
  • Easier to add a new step
  • No cyclic dependency
  • Easier to reason about
  • Easier to test and troubleshoot
  • Less performant
  • Coordinator must implement the state machine carefully

The Commonalities

  • Both approaches implement a state machine to fulfill a distributed transaction. The event sourcing approach implements the state machine implicitly, which is done collectively by all involved services. In contrast, the coordinator approach implements the state machine explicitly.
  • Individual services must provide a compensation operation for each (forward) operation
  • Individual services must provide idempotent semantics
  • Only support eventual consistency, as it’s very difficult to support isolation between transactions hence must tolerate in-flight inconsistency

Wrap Up

Handling distributed transactions is not easy. I hope this essay sets you up in a position to understand the various issues you need to consider when designing amicroservice architecture.

If I am to choose between the event sourcing vs. coordination approach for handling distributed transactions in microservices, I would prefer the coordination. I would only consider the event sourcing if it is supported out-of-box by some framework. Even so, I may still prefer the coordination approach due to its simplicity.

Although at a high-level, the solutions presented seem straightforward, it calls for careful design to actually implement everything right. I have been skimming on the details of idempotency. I have also completely ignored the concurrency issues that may arise from distributed transactions.

Concurrency will become an issue when we scale out the microservices horizontally, i.e. deploying multiple instances. Imagine that the coordinator sends/publishes the command/event to instance A and later sends/publishes the rollback command/event to instance B. It is likely that instance B may attempt to execute the compensation before instance A even starts to execute the (forward) operation. The result might be the compensation does nothing while the (forward) operation gets executed, which is not what we want. This could happen more often with asynchronous messaging, but it can also happen with synchronous RPC in the event of network partitioning (timeout).

I hope I will find time to write about idempotency and concurrency issues, so please check back later :-)

Part II: https://medium.com/@zongwb/distributed-transactions-in-a-microservice-architecture-b4d6494de59e

References

BASE model. https://queue.acm.org/detail.cfm?id=1394128

Martin Fowler. The Many Meanings of Event-Driven Architecture https://www.youtube.com/watch?v=STKCRSUsyP0

Chris Richardson. Using Saga Patterns to Maintain Data Consistency in a Microservice Architecture. https://www.youtube.com/watch?v=YPbGW3Fnmbc

Randy Shoup. Managing Data in Microservices. https://www.youtube.com/watch?v=E8-e-3fRHBw

https://blog.couchbase.com/saga-pattern-implement-business-transactions-using-microservices-part/

https://blog.couchbase.com/saga-pattern-implement-business-transactions-using-microservices-part-2/

https://www.nginx.com/blog/event-driven-data-management-microservices/

https://en.wikipedia.org/wiki/Two-phase_commit_protocol

https://en.wikipedia.org/wiki/Consistency_model

https://martinfowler.com/articles/microservices.html

https://martinfowler.com/eaaDev/EventSourcing.html

https://github.com/cer/event-sourcing-examples/wiki

https://github.com/cer/event-sourcing-examples/wiki/DeveloperGuide

https://medium.com/@so3da/transactions-and-failover-using-saga-pattern-in-microservices-architecture-baf5a13111c9

--

--