This post is about an open source tool I have created to solve a number of challenges when it comes to coordinating distributed microservices.
This tool has already powered more than a million workflows (and counting) and has proved to be a valuable solution to architect, deploy and monitor business flows at my work.
Here’s a lowdown of the challenges & how I have approached solving these challenges.
Microservices are the backbone of modern cloud native applications. They allow an application to be deployed as a set of containers, which are dynamically orchestrated to optimise resource utilisation.
- Different parts of the application flow to be built with different considerations, be it coding language (or) the database used etc.
- They also allow different parts of the application to be deployed, scaled and maintained differently. For example, some micro — services are stateful by nature, so they are deployed as containers and kept warm always, rather than as serverless functions. See this.
Micro-services provide flexibility to architect, deploy and maintain different parts of applications differently.
But this flexibility brings in two major concerns (among many):
From a business perspective, one does not care how well you have crafted your micro — services.
Consider the example of fulfilling a food order.
You may have crafted the application as a number of complex micro services under the hood interacting with each other, coupled with a bunch of human interactions.
Immaterial of the complexity of these services under-the-hood, the singular business objective to fulfil in this case is to deliver the food on time.
Secondly, how do you ensure that all these micro-services talk to each other in the right way and fulfil this business objective?
You cannot make one micro-service call another — that will add concerns that are beyond the responsibilities of a micro-service; and it also violates the principle of loosely coupled services.
What do you do when one micro-service fails or is un-available for a short time? Who is responsible to re-route and/or retry?
Who takes care of maintaining the overall state of the business flow in question?
The above challenges, with ever changing business requirements notwithstanding, will end up making your micro-services architecture look like a spaghetti when micro-services talk to each other themselves.
How do we address these concerns without losing the superpowers of micro-services?
Enter scheduler agent supervisor pattern, which does the following:
Coordinate a set of distributed actions as a single operation. If any of the actions fail, try to handle the failures transparently, or else undo the work that was performed, so the entire operation succeeds or fails as a whole. This can add resiliency to a distributed system, by enabling it to recover and retry actions that fail due to transient exceptions, long-lasting faults, and process failures.
Let’s look at how this pattern addresses the aforementioned challenges:
- Central responsibility of flow control: In this pattern, there is a central scheduler (a.k.a workflow manager) that coordinates a set of distributed actions as a single operation. This relieves micro-services from any the overall flow control, thus putting them into a zone where they are at their best. Do business logic well. That’s it.
- Maintaining application state with compensating transactions: Some micro-services have failed while executing the flow? No problem, the central workflow manager can now record this situation, so that a retry manager can either retry the service once again or apply compensating transaction to maintain the correct state.
- Event based communication (Dumb Pipes and Smart End Points): the central workflow manager & the micro — services talk to each other through events over a durable message queue. The queue just acts as a store to pass on the events, thus the tag dumb. Both the workflow manager & micro-service are smart enough to understand the event and act accordingly.
Managing overall application flow with scheduler agent supervisor pattern allows all the components (workflow manager / scheduler, retry manager & micro-service) to perform their acts efficiently.
OK, this is nice. Now we have a central orchestrator that takes care of the singular business objective.
But, what if the workflow manager itself goes down? How or who will manage the business flow forward?
How do we ensure that the workflow manager is resilient enough to survive intermittent failures?
We can attack this concern with a two-fold solution:
- Make workflow manager persist current state of the business flow into a fault tolerent database.
- Spawn multiple workflow manager instances.
With this, even if one workflow manager dies, there is another which responds to the events and maintains sanity.
In addition to resiliency, persisting the current workflow state into a database also allows for observability of the overall flow.
Microservices offer unmatched flexibility & agility when it comes to architecting a cloud native application.
With Scheduler agent supervisor pattern, one can implement a central, resilient scheduler that takes care of the overall flow orchestration & maintain co-ordination of multiple services as a single operation.
In addition, persisting current workflow state allows a separate retry manager to either retry the failed operation (or) apply compensating transactions to maintain state integrity. It also makes the entire set of business operations flow observable.
Serverless Workflow Manager project is an open source implementation of Scheduler agent supervisor pattern, which uses AWS Lambda, Amazon SQS and MongoDB Atlas to provide a true cloud native & fault tolerent way for coordinating distributed microservices.
Serverless Workflow Manager employs a configurable JSON DSL to author workflow definitions. With this:
- You can re-use repetitive tasks with ease, by using the same Task configuration
- Pack additional fields in Task configuration, which can be used by the micro-service task to handle additional business logic etc.
- The workflow instances can be monitored and visualised in a front-end!
The solution also persists current state of workflow instance into database, thus enabling resiliency, retry and observability of business workflows.
You can checkout the project here.