Is there a better way to tell Serverless Saga?
Setting the Stage
My previous post was about Consistent Modeling of Serverless long-running background tasks. There, the following Process Model diagram of a social network crawler was presented:
In such architecture, the whole process of collecting data about a particular MakirOto system user is orchestrated by an AWS Step Function. This type of functionality has many titles such as distributed transaction, workflow, Saga or Process Manager, to name a few.
Yan Cui published a very good article about using the AWS Step Functions and AWS Lambda for implementing the Saga Pattern. It is highly recommended to read in order to understand how these three concepts fit together.
Here is a Process Model of the MakirOto Data Collector component:
This diagram is a good example of the fact that expressive power of any consistent graphical notation is severely limited: the more clearly we want to say something, the less we could/should say.
As in all previous examples, this diagram depicts very accurately reference pointers and access permissions between computation processes and resources. It does not explain, however, what really happens here and, in particular, which computations run in parallel and which are sequential. Since the latter is the primary goal of the Process Model, we have a problem.
Let’s try to understand where the problem comes from and what could we do about it.
The main problem roots in the way humans interpret graphical representations. By looking at the following diagram, for example, almost everybody will automatically assume that all Lambda Functions are running in parallel:
Therefore, in addition to access pointers and permissions we get a pretty good intuitive understanding of the system concurrency — every API endpoint request will be served in parallel.
This, in turn, means that all Lambda Functions will need some form of synchronization, presumably provided by the Neptune Graph DB.
This diagram says nothing about internal concurrency of individual Lambda Functions. Since we do know that Lambda Functions follow the “instance per request” policy, we could assume that internal threads do not exist at all.
Even more, if some Lambda Functions calls another Lambda Function, we will get a good understanding of where sequential processing is. So far, so good.
It’s important to stress it one more time: such intuitive interpretation is deeply rooted in how human brains process visuals and cannot be easily avoided.
Now, if we take a look, again, at a number of Lambda Functions orchestrated by a Step Function, we have a problem:
Here, we could not automatically say what runs in parallel, and what sequentially.
In fact, all Lambda Functions do run in parallel, but in a different sense than we would assume intuitively.
In the current notation, we have a clear separation between service (e.g. AWS Step Functions Service) and individual resource managed by this service (e.g. specific Step Function). We still do not distinguish, however, between Resource Class and individual Resource Instance. A similar ambiguity exists between UML class and Object diagrams.
We could assert that as long as all instances of the same Step Function are concerned, all Lambda Functions invoked by these Step Function instances will be running in parallel. What happens, within individual instance of the Step Function we, by looking at the Process Model diagram above, could not say. In other words, our understanding of real concurrency within the system is incomplete.
One could rightfully argue, that internal concurrency of individual Step Function is directly reflected in its diagram. For example:
This is not completely wrong, but appealing to the AWS Step Functions diagram opens a new can of worms, full of semantic consistency problems.
Let’s start with a simple one — terminology. What AWS Step Functions service calls “state machine” is not a state machine at all, it’s an activity flow, usually modeled by a UML Activity Diagram.
To add insult to injure, the AWS Step Functions service calls activity steps “states” and the only true state it has, is called, guess what … right — “activity”.
Why AWS Step Functions architects did not bother to take a look at the UML specification and decided to introduce their own terminology thus messing up the whole business, is beyond my comprehension.
With and Adequate Modeling Tool it Gets Clearer
So, in order to understand internal concurrency of a Step Function, we need to model it with a UML Activity Diagram. Here you go:
In this case, all branches follow the same pattern:
Now, we could safely assert, that Facebook Profile, Facebook Photo and LinkedIn Profile are processed in parallel, while inside every branch its steps are conducted sequentially.
Putting aside the question whether this is the most optimal solution and what would be an alternative, let’s notice that UML Activity Diagram belongs to the Logical Model rather than to the Process Model.
Therefore, in order to truly understand the system concurrency, we need to look at two models and to reconcile them in our heads. With a little bit of training it is not very hard to do. The main problem is lack of awareness among practitioners.
Are Step Functions Good for Sagas?
Back to implementing distributed transaction with AWS Step Functions. As I argued in my recent presentation at Pycon IL 2018
there is a certain advantage in clear separation between individual computations, implemented as Lambda Functions, and an orchestration layer dealing with concurrency and error handling, implemented as a Step Function.
As a part of my research I took a fairly extreme approach denying compound behavior whatsoever from all individual Lambda Functions:
The argument goes that selecting a particular form of concurrency should be done by infrastructure rather than by developers. And may be, one day, we will get to this Threadless Paradise.
The question, I want to explore at the end of this post, is whether AWS Step Functions, and UML Activity Diagram for that matter, are proper tools for modeling distributed transactions (Sagas)?
The answer is “unfortunately, not at large scale”. The main problem is that while AWS Step Functions may run multiple branches in parallel, individual Lambda Functions are invoked synchronously, unless one wants to mess up with “activities”. True Sagas, on the other hand, are implemented asynchronously and modeled with UML Statecharts. Why? Because of scallability.
Within a limit, the AWS Lambda Service is elastically scallable. This is not true, in general case, for managed services.
For example, AWS Dynamo DB sets limits on read and write activities. When this limit is crossed, there is a penalty to be paid either in a form of rejected requests or increased cost. Fully limitless computing is still more a dream than reality.
DynamoDB I/O capacity, off cause, could be properly measured, tuned and reserved. All this, however, brings us back to the old IT world, just in a fancy modern clothing. In particular, this approach does not deal well with temporal spikes.
On the other hand, if Lambda Functions were invoked asynchronously these spikes could be leveled off by the Lambda Function inbox internal queue. This, in turn, would allow pushing the reserved capacity levels down. But then, in order to proceed, the Saga needs to get an event back when a particular operation was completed. In other words, it has to behave as a true state machine.
For example, the famous travel booking Saga could be modeled as follows:
This is, off cause, criminally oversimplified model, produced to just illustrate the point. The real state transition model would be significantly more complex. But even at this oversimplified level we could identify some important differences:
- clear separation between local operations, performed locally, and messages to external services, sent asynchronously
- States and transitions, triggered by events, as basic building blocks
Ironically, while Serverless architecture proclaims itself as Event-Driven, there is currently no good infrastructural support for event-driven state-oriented programming. Here are some options available:
- Still to use AWS Step Functions and to model each state with “activity”.
- Asynchronous communication between Lambda Functions, simulating the actor computing style.
- Asynchronous communication between Lambda Functions plus handling events coming from affected resources.
- Using a full functioning Event Store, as advocated by Event Sourcing proponents, and designing the whole system around it accordingly.
Let’s briefly analyse the last three options.
Asynchronous Saga Implementation with Actor-like Computing
This option might be implemented as follows:
Here, the Process Manager Lambda Function, responsible for implementing some particular Saga, keeps process states in a persistent storage, say Dynamo DB Table, and invokes relevant services asynchronously passing its ARN and process identification as parameters.
Another group of Lambda Functions are responsible for implementing particular services, say Hotel Booking, Flight Booking, etc. Usually, but not always, this incurs encapsulating some managed service resource, such as another DynamoDB table. When the Service Lambda Functions finishes handling the service request, it sends a success/failure message back to the Process Manager using the function ARN and process ID obtained as parameters. This is conceptually very similar to how typical actor-computing systems, such as Erlang or Akka, work.
This approach is not bad. It will scale fairly well even though some overhead affecting performance is introduced. The latter is not such a big problem — Sagas are seldom required to perform at near-to-real-time speed. There are still two more serious problems.
First, managing access permissions is not trivial. By definition, Service Lambda Functions are unaware about calling Sagas, so the corresponding policy cannot be established in advance. Using naming conventions and wildcards will open yet another security hole. It could be done automatically, when a new Saga is deployed, but it’s a bit tricky.
The second problem is that the Service Lambda Function will send a success/failure message when it thinks the data update is done, not when it really happened. Increasing consistency requirements in DynamoDB call will hurt throughput. This leaves a small probability that the resource update will still not be completed, while the Process Manager will think otherwise.
Asynchronous Saga with Resource Event Handling
The second option is depicted below:
In this case, the Process Manager still sends service requests asynchronously, but the Service Lambda Function does not send anything back. It just handles the request, which results in the corresponding resource update (e.g. putting a new record in DynamoDB table).
In addition, the Process Manager Lambda function listens to the resource update streams and reflects this in its own state accordingly. The biggest advantage of this approach is that it updates the Saga state when resource updates really happen. It does have, however, its own limitations.
First, it’s now not trivial to associate updates in particular resource with the process. Indeed, how will I know which Saga is waiting for that DynamodDB record to be created or updated? While some conventions could be applied it will always be ad hocy and patchy.
The second problem is that the Process Manager now needs to be aware about the Service internals. It suddenly needs to know that a DynamoDB table is in play, and which schema it has.
The third problem is that for purely computational services the first mechanism will still be required.
Asynchronous Saga with Event Store
At the moment, there is no managed Event Store service on the AWS cloud. However, there are a number of open-source services and frameworks, which, in principle could be packaged this way. Without getting into details which one to choose, let’s assume we have it and look at how asynchronous Saga implementation could look like then:
This design exhibits a number of subtle yet essential differences.
First, the Service Lambda does not update a state resource anymore, but rather posts events into particular Event Stream.
Second, this Event Stream has well defined and stable interface. There is no problem for Process Manager to be aware about it.
Third, the Process Manager could subscribe to this Event Stream upfront. This would probably lead to the cleanest possible solution.
Some problems still remain. There is no Serverless-native Event Store solution on AWS yet. All available frameworks will incur certain compromises.
In addition, Event Sourcing and Command-Query-Request Segregation (CQRS) are opinionated design patterns. Not everybody accepts these patterns. Finally, only few people really understand these patterns and are able to apply them correctly.
In any case, having a 100% Event Sourced system is traditionally considered as an over engineering. But, perhaps, really scallable Sagas could not be implemented otherwise.
All, what we could say for sure now, is that Serverless-native architecture is still emerging, and that some experimentation and exploration in this area are required.