Choreography and Orchestration in microservices

Arye
Israeli Tech Radar
Published in
6 min readSep 11, 2023

Running a complex process that involves multiple services running specific tasks on multiple nodes can be achieved by using a few communication and collaboration patterns.
In this post I list possible options, describe them, provide specific examples and discuss their advantages.

Let us try to imagine how multiple services running on multiple nodes can cooperate.

A few patterns emerge.

Communication styles

Let us first describe possible communication styles.

Synchronous blocking communication: Commands

In the example pictured, we can imagine that novice fishermen use direct and explicit commands such as voice or gesture to cooperate.

Asynchonous non-blocking communication: Events and Commands

The behavior of the fish swarm however, is determined by other kinds of events such as a loud noise for example. The fishermen try to trick their prey with misleading events as well.
We can imagine as well, that an experienced team of fishermen cooperate using this pattern. By responding to changes in the environment, each member of the team knows how to react without having to be told explicitly.

Collaboration styles

Collaboration styles are orthogonal to communication styles but specific communication styles are most commonly used with specific collaboration styles.

Direct calls

Microservices use synchronous blocking communication.

This is the most common approach. It is quite appealing because of it’s simplicity. In order not to create a tightly coupled distributed monolith though, some care must be taken to make the coupling as loose as possible by implementing some sort of graceful degradation at least with retries.
Indeed, individual members of the fishing team may make decisions and try to instruct others without success.

Choreography (Event-Driven)

Microservices use asynchronous non-blocking communication.

A fish swarm instinctively reacts to events in order to escape. The fishermen make a team decision to go fishing but individuals try their best to take the most appropriate action at a given time in order to catch as much fish as possible.
The complex process of fishing, whether it is a success or a failure is implicitly defined by how each service responds to events. There can be similar fishing experiences but the execution is very often different every time.

Orchestration (Command-Driven)

If we use our fishing analogy again to describe orchestration, we can imagine that one expert fisherman is part of the team and is responsible to precisely instruct each member on what to do and when (centralized orchestration). Depending of the size of the fishing team this responsibility can be also shared by a few more experienced people responsible for small teams. Changing tactics affecting the whole fishing process (decentralized orchestration) would also be their responsibility.
The fish however are unlikely to obey any commands unless they are assimilated.

The Borg Queen — most efficient orchestrator ?

The Saga Pattern

Before we discuss implementation options let us leave fishing and consider the atomicity of a process involving many services. Typically when planning a trip for example, a hotel reservation has to be cancelled if the flight happens to be canceled. The answer to implementing transactions that span services is the Saga Pattern:

A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule then the saga executes a series of compensating transactions that undo the changes that were made by the preceding local transactions.

The execution of the happy path tasks with their local DB commits in individual services and possibly their rollback in case of failure can be implemented by using either orchestration or choreography.

Implementation options

Handling retries for long running tasks imply to persist the state of the failed tasks and add a scheduler to retry at a later time. The state of the tasks include such metadata as the status (Failed, Succeeded), the execution time, the input parameters… Although you can roll your own state management and scheduler, it might be of interest not to reinvent the wheel.

There are so many tools available in this domain that one may feel overwhelmed. To only name a few here are some possible candidates:

Netflix Conductor

Conductor is a platform created by Netflix to orchestrate workflows that span across microservices.

Apache Airflow

Airflow™ is a platform created by the community to programmatically author, schedule and monitor workflows.

Argo Workflows

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

Celery (Canvas)

Canvas: Designing Work-flows

AWS Step Functions

Visual workflows for distributed applications

Camunda Zeebe

Cloud Native Workflow and Decision Engine
Automate processes at scale with unprecedented performance and resilience

Control M from BMC Software (Proprietary)

Control-M simplifies application and data workflow orchestration on premises or as a service. It makes it easy to build, define, schedule, manage, and monitor production workflows, ensuring visibility and reliability and improving service level agreements (SLAs).

Temporal

Temporal is a microservice orchestration platform which enables developers to build scalable applications without sacrificing productivity or reliability. Temporal server executes units of application logic, Workflows, in a resilient manner that automatically handles intermittent failures, and retries failed operations.

Help me choose!

With the knowledge of the communication and collaboration styles described above, let us try to categorize some of them.

Orchestration

Most centralized orchestrators such as Conductor provide great observability through a UI allowing to:

  • view the execution status of the whole workflow or of individual tasks
  • view a timeline graph that allows to spot tasks slowing down the whole workflow.

On the performance side, the communication style (synchronous blocking) especially when used to trigger the workers using polling as in Netflix conductor causes a significant overhead.

Choreography

Here the choice is about implementation rather than selecting a specific tool. As elegant and simple to implement as choreography may be, better observability can only be achieved by using extra implementation specific or distributed tracing solutions.

Typical workflow definitions

We have already described workflows at length without even giving one code example. Let us remedy that and address the developer experience.

Typically a workflow is represented internally as a DAG and can be described as JSON file, as code or in a specific DSL.

DSL Example with apache Airflow of tasks executed in sequence

download_launches >> get_pictures >> notify

JSON example of an image processing pipeline with Netflix Conductor

This workflow contains 2 tasks that run in parallel (FORK)


{
"name": "image_convert_resize_multipleformat",
"description": "Image Processing Workflow",
"version": 1,
"tasks": [
{
"name": "image_convert_resize_multipleformat_fork",
"taskReferenceName": "image_convert_resize_multipleformat_ref",
"type": "FORK_JOIN",
"forkTasks":[
[
<task 1 workflow>
],
[
<task 2 workflow>
]
]
},
{
"name": "image_convert_resize_multipleformat_join",
"taskReferenceName": "image_convert_resize_multipleformat_join_ref",
"type": "JOIN",
"joinOn": [
"upload_toS3_jpg_ref",
"upload_toS3_webp_ref"
]
}

],
"outputParameters": {
"fileLocationJpg": "${upload_toS3_jpg_ref.output.fileLocation}",
"fileLocationWebp": "${upload_toS3_webp_ref.output.fileLocation}"
},
"schemaVersion": 2,
"restartable": true,
"workflowStatusListenerEnabled": true,
"ownerEmail": "devrel@orkes.io",
"timeoutPolicy": "ALERT_ONLY",
"timeoutSeconds": 0,
"variables": {},
"inputTemplate": {}
}

Choreography example

When defining a choreography, specific syntax can be used to spawn other tasks instead of using client code to publish events to a topic. Example of spawning tasks in parallel with Celery Canvas:

res = group(add.s(i, i) for i in range(10))()

Architecture considerations and final thoughts

As mentioned earlier one should take advantage of existing orchestration engines but the selected tooling is not a silver bullet. On the contrary some tools may introduce performance issues when used in environments with poor network bandwidth and latency.
As seen in our fishing analogy, a process may need both centralized orchestration for each service and choreography.

The long tail of applications that do need centralized process automation have many tools available. For the choreography part, the missing observability can be filled by off the shelf distributed tracing tools.

--

--