Case Study: A DIY Saga Pattern Implementation

Published in

Geek Culture

7 min readMay 14, 2021

Photo by Jorge Fernández Salas on Unsplash

Working on Event-Driven Architectures is one of the most satisfying experiences for software engineers: it’s stunning to see how many simple actions collaborate to form emerging behaviors and make the magic happen. Saga pattern is an elegant way to design long-term, distributed transactions in a BASE fashion, fostering scalability and resilience in distributed, reactive systems that can afford huge workloads, still providing a good user experience

In this article, we’ll see how to implement the saga pattern without using specialized frameworks (like Axon, Eventuate Tram, etc, which are mainly used in the context of CQRS/ES/CDC implementations). Our goal here is to improve the understanding of the mechanism and see how it can be achieved even using a “vanilla” stack, for example with the popular Spring Boot/Spring Data. For the implementation, I’ve used Kotlin because it’s a nice chance to keep practicing with this modern and elegant JVM language 😊

NOTE: through this article, it will be assumed that the reader has a clue of what Saga Pattern is and how it works: there are a lot of articles and papers which explain it comprehensively. Despite that, if you’ll enjoy this story and you want to read from me an in-depth description of the pattern and the fundamentals of distributed systems, just leave a comment here!

TL;DR
here is the code : https://github.com/cingaldi/sagapattern

Analysis Phase

Requirements

Let me introduce our use case:

We are working on a trip management platform: our customers can create Trips where they can book flights along with hotels. We can handle both these bookings using codes that will be used by 3rd party services. In case the flight and the hotel room are successfully reserved, we notify the customer that the trip is confirmed. In case the trip confirmation fails, we notify the customer

First things to notice:

The trip can be confirmed once flight and hotel reservations are. Both steps are assumed to happen asynchronously. Therefore, they can give us an outcome in seconds, hours, or never
For the sake of system performance, all the interaction should be executed in a non-blocking fashion
If for some reason, a flight or hotel cannot be reserved, we should revert the entire operation. So every booking should be canceled (look here at how emerges the concept of eventual consistency)

Modeling these requirements with a micro-Event Storming, this is what we are trying to achieve

Take a look at the purple sticky note there: this is called Policy and indicates that in our workflow there is a step that requires some decisions to be taken, potentially outside our system. In this case, we can summarize it summarized as

whenever flight and hotel are booked then confirm the trip
whenever a flight or hotel are canceled then abort the trip

Usually, a policy translates into a saga

Architecture

Now imagine we are working in architecture like this

We see that we are ready to handle the workflow asynchronouly and provide a consistent façade towards external services. The whole transaction will be coordinated by TripService which exposes a REST API to create a trip

Building Blocks of the Saga

The Saga Manager

This is the entry point of the Saga and acts as a glue code: In a nutshell, it implements a listener for each event involved with the Saga and decides what command to dispatch after which event. Spring framework offers an event bus out of the box, so we won’t struggle a lot on this detail.
Notice that the Saga Manager belongs to the application layer: it doesn’t implement real domain logic, it takes only care to choreograph the flow

So why couldn’t just add a bunch of event listeners to the TripService? The answer of course is SRP, and pointing out that an application service wraps an aggregate root while the Saga Manager has not just to deal with one aggregate root

Saga state and Saga Repository

The saga state (or, for the sake of simplicity, just saga) is what sits at the core and is a domain concept, so — in general — we can find it in our ubiquitous language with a name that represents a Status, a Progress, a Process or something similar. It represents a part of the application state, so we can fairly consider it as a Domain Model object, just like entities, aggregate roots, or value objects. It’s persisted so it will have an associated saga repository. What we expect from the Saga is to

Make data associations Since saga reacts to events coming from disparate aggregates, we need to bind the saga with each of the aggregates that participate in the choreography. We can see in our code that hotelCode, flightCode, tripId uniquely identify the saga. For this reason, the saga repository defines query methods to fetch status associated with them
Keep the actual status The saga can persist whatever you find relevant to make decisions on each step. Notice that we are defining methods that contain that state transition logic, it’s a good OOP habit to encapsulate object state and keep behavior close to data
Decide what’s next every time the saga status changes, some commands can be dispatched, this is how the saga coordinates the involved components. Thus, every method decides what command to apply next

The Command Façade

Ok, this isn’t matter of life or death, but if you want a nice separation of concerns, an expressive code, and higher extendability, this pattern is just for you. In a nutshell, we are implementing the concept of command as a message that flows into a bus and matches the correct handler which, in turn, will apply that command over an aggregate. If this seems similar to what an ApplicationService/UseCase do, you’re not wrong: it is but in a more decoupled way

There are few ways to implement this pattern, the easiest is to provide a proper façade object which will wrap dependencies on needed services and exposes as many functions override as more command objects we have. The most complex way can use reflection and annotations to get the matching between the command and the correct handler. Let’s just stick with a halfway solution that provides enough flexibility and uses the sweet Kotlin’s syntactic sugar

Handling Deadlines

Don’t forget that a saga is meant to handle asynchronous processes. And if there is a thing that all asynchronous things do, is to not show up when you need them. The Hotel server can crash, the flight company can run out of business, so sometimes we need to wrap up somehow

For this reason, saga can come with deadlines. A deadline is a particular event listener that is triggered by a time-based event. It can be triggered after some time after the saga starts. In both cases, despite the solution looks trivial at the SagaManager level, its implementation depends on the available infrastructure tools

You can just use a cronjob or an external trigger. Furthermore, some message-oriented middlewares (like Amazon SQS or, with some hacks, RabbitMQ) provide delayed message features that let you dispatch a command at saga start, like “after 10 minutes, trigger a timeout event”. At this point, we will make our saga gracefully fail

In our case, we will just adopt a quick-and-dirty solution using Kotlin coroutines. Now it’s just a matter to extend the code to persist the scheduled tasks and serve them robustly. Moreover, like optimization, it would be nice to provide the saga manager with ways to locate the task and cancel it in case of successful completion

Testing the Saga

Usually, we can get how much code is good from how tests look like. It’s important to have a simple way to test the entire saga, defining all the possible paths the workflow can follow and asserting that the outcome is the expected. After working with Axon Framework, I’ve felt in love with their fluent, behavior-driven testing tools. Basically here we want to test (also) the SagaManager, and the test should be like:

given(SomePriorEvents)
when(CertainEventTriggered)
then(DispatchedThisCommand)

Of course, asserting against CommandDispatcher helps a lot. Moreover, to keep the state across the step invocations, we can use a fake SagaRepository which holds an actual instance of the saga. This can increase test confidence concerning a solution where we use stubbing to test individual steps

What’s missing?

We have only implemented the happy path, but what happens when either hotel or flight is not confirmed?
We are not really managing internal failures. How to overcome exceptions due to concurrent writes? What should happen if the message broker fails to distribute tasks to Hotel/Flight services?
A flight (or a hotel as well) can be canceled even after the confirmation, and this can trigger refund policies for the customer. How can we extend this saga to fulfill this part of the process?

Conclusion

In this article, we have implemented a small application that choreographs a distributed transaction thanks to the Saga Pattern. The use case was fairly simple and the implementation quite far from being production-grade. Nonetheless, it was fun for me to develop it and also gave some little challenges. Hoping that it was the same for the reader, I’ll conclude the story with a hint: if you want to know something better, just code it from scratch!