Microservices to Workflows: The Evolution of Jet’s Order Management System

  • Order Initialization and Verifications
  • Charging / Credits / Money Management
  • Order Fulfillment Integrations
  • Order History
  • Concessions (Refunds, Returns, etc. )
incoming-streams
|> decode (DomainEvent -> Input option)
|> handle (Input -> Async<Output>)
|> interpret (Output -> Async<unit>)
  • Idempotence — The ability to de-duplicate events (triggers) based on a unique identifier.
  • Consistency — We provide support for multiple different backing stores as our state management layer. This means that we also provide configurable consistency models based on our backing store. The implementation of our state management layer will always mimic a strong consistency model since we need to be able to read our own writes.
  • Deep Workflows — Workflows are represented as DAGs (Directed Acyclic Graphs) but can be as nested as you want them to be.
  • EventSourcing — State changes are all fully event sourced into the journal without having the need to understand the event sourcing semantics. The system is designed in a way to encourage thinking in terms of “events”. Note: The OMS system has always had event sourcing as a key component but with this new system, it’s baked directly into the system.
  • Simple Implementation — Business functionality is implemented as a workflow definition and the corresponding step implementations. This forces modularization of the system and forces developers to think upfront and think through the business flow, before jumping into the implementation.
  • Reusability — Designed with reusability in mind but leaves enough flexibility to the developers so they can design flows as they see fit. When new workflows need to be introduced into existing flows, developers can either create new flows from existing steps or reuse at the workflow level. For example, in order to handle digital SKUs (Apple Care) within Jet, we added new workflows to handle warranty, while allowing the rest of the order workflow to remain unchanged. This allows us to iterate rapidly and deliver new functionality into the system at great pace.
  • Verification/Regression — The verification of step behavior via tooling allows for quickly run regression tests. We take full advantage of this capability at Jet and our workflow implementations have over 80% of code-coverage without having to write dedicated tests. New tests can be easily generated with the click of a button.
  • Idempotency — Provides idempotency guarantees for a workflow. Idempotency is achieved through a combination of unique identifiers configured from the workflow DSL.
  • System Scalability — The system is elastic in nature. The system can be easily scaled to improve the throughput of business flows. In our current usage, the scale of the system is limited by the number of partitioned consumer channels used to communicate between the core services.
  • Workflow Versioning — The system maintains the workflow version throughout the execution of the said workflow. That is, Workflows don’t change for an instance once it gets started. This allows us to deploy changes to workflows without having to be concerned about those executions in flight. Since workflows are a representation of business flows, this allows us to iterate on business flows independently.
  • Low-Level Concerns are Handled Once — Scalability, Performance, Idempotency, Retries, Error-Handling, etc. are all handled at the platform level rather than in each microservice.
  • Metrics /Monitoring— System level metrics are easily controlled and business flow execution is easily tracked/traced and monitored.
  • State Management — A single source-of-truth for all state changes known as the Journal. The Journal acts as a standard log that powers the system. It also acts as a debugging tool to fully understand the history of a given workflow. Having traceability over state changes is a big difference compared to the event sourced microservices we have historically used.
  • Support for Deferred Workflows — The ability to configure a workflow definition to allow only one workflow to execute at-a-time. This ability is very challenging to achieve in Microservice based architecture.
  • Manual Review — Workflows that are either blacklisted or have some unwarranted failure are written for manual review and can be reviewed and resubmitted through the Visualizer (Front-End for the Workflow System, more on this in subsequent posts)

Architectural Overview

The new workflow system was largely inspired by Life Beyond Distributed Transactions by Pat Helland and was designed as a two-layer architecture :

  • Infrastructure Layer — handles concerns such as scalability, idempotency & correctness, error-handling/retries, logging, metrics, etc. The goal was to solve these concerns once and not for every service or use-case.
  • Workflow Layer — Scale-agnostic and deals with actual business implementation, this is the workflow DSL and the corresponding step implementations.
  • Workflow Triggers (decode)
  • Workflow Executor (handle)
  • Side Effect Executor (interpret)
Figure 1 : Architecture Diagram

Workflow Definition

Workflows are defined using an F# DSL (Domain Specific Language). However, the system is not strictly limited to the F# DSL, this DSL has been demonstrated to work with other languages like Javascript. The workflow DSL defines a series of steps that need to be executed. An example of a workflow definition:

  1. Trigger: How should we trigger this workflow, i.e. Kafka message, service bus message, EventStore steam, REST, etc.
  2. Metadata: This is what controls meta-parameters of the workflow like retries, concurrency locking, aggregates, etc.
  3. Steps: What are the critical behaviors of the workflow or what should the steps be? In what order should we execute the steps, should there be conditional steps, can we execute more than one step in parallel, etc.
  • State: The current state that is passed between steps
  • Input/Output: Steps have the ability to pass an output which is supplied to the preceding step as it’s input.
  • Side-Effects: A list of side-effects that need to be executed by the step.

Visualizer

The workflow system also included a tool called the visualizer, that lets users get insights about workflows. The Visualizer can show workflow details both for in-flight workflows as well as historical workflows, by showing the details in the workflow journal. Shown below is an example of how the Visualizer supports inspecting any single execution of the workflow.

  • Verification & Regression Tests: Ability to verify one or more steps from a previous execution of a workflow.
  • Manual Review: Ability to inspect and resubmit failed workflows or side-effects
  • Self-Documenting: Ability to review state, input, and output for every workflow and every step at any point in the execution of the workflow.

Some Stats

Below are some of the stats from the production instance we have been running for a little over a year now. Most of these workflows are used in the standard order processing flows:

Future Extensions

The following features are things that have been discussed as logical extensions of the workflow engine.

  • Support for Lambda’s (or similar) server-less functions. These functions would act as step implementations for the workflow which would be leveraged by the Orchestrator.
  • PAAS model for OMS 2.0, where the core platform is deployed once but new workflows can be uploaded into the existing running deployment. Using .Net was the limiting factor to support this, however, the recently added Javascript support allows this to be possible.
  • Support for .Net Core and Linux containers

Alternatives

It’s not unknown to us that there are alternatives for workflow orchestration and design like:

  • Ability to maintain a separate data store to hold workflow events to recover from failures
  • State tracking and management, ability to replay or visualizer the state at any point in the execution
  • UI for visualization of flows and in process workflows
  • Integration with our existing infrastructure, and integration with our existing technology stack (Microsoft Azure + F#)
  • Extensibility in adding new features or functionality depending on business needs
  • Scalability, the ability to scale workflow execution across multiple VMs, in several regions.

Conclusion

The migration from a distributed microservice based architecture to a workflow based one has had dramatic effects on our development, support, and design overhead. The ability to design and rationalize complex business flows as a DSL and then implement single-responsibility steps has had profound effects on our ability to innovate and build complex systems. The other benefits that the workflow engine provides like tooling, troubleshooting, scalability are topics of future posts.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store