Event Storming applied to distributed task management

Published in

Linagora Engineering

4 min readMar 11, 2019

For months James has been under huge work to make it distributed.

The last remaining part was the WebAdmin which allows to manage long-running tasks (creating, stopping, watching), such as ReIndexing mailboxes.

It is currently represented via our TaskManager interface:

We currently have one implementation (MemoryTaskManager) which use ExecutorService as back-end. It's the first issue we have to face: MemoryTaskManager have two responsibilities:

Managing Tasks
Running Tasks

Making TaskManager distributed implies being able to:

Start a Task from any node of a cluster
List every Tasks of the cluster
Query any Tasks of the cluster
Cancel any Task of the cluster
Await for termination of Tasks of the cluster

Our first idea was to persist all the Tasks, and designate one node to run them. Anyone who have ever read about distributed systems know that distributed consensus is generally a bad idea.

We have taken a step back and listed what James is currently using to work:

Cassandra
RabbitMQ
A home-grown event-system

We have three main concerns:

Persisting Tasks (to be listed), it will be Cassandra's responsibility
Communicating between nodes (to be able to pick a Task only once, cancel or wait for a Task), it will be done by RabbitMQ
Keeping track of a Task state which is a perfect fit for our event-system

Sketching the system thanks to Event Storming

Event Storming is a meeting where all the concerned actors (developers, product owners, business analysts, domain experts, etc.) are present. The goal of this meeting is that every attendees get the same understanding of a limited system. Event Storming usually begins by listing all the Events of a given domain, which are expressed in past tense.

We try to represent all the Events happening to a Task in the most basic scenario:

Created: it is the Event which initializes the aggregate (the coherence unit which will receives Events)
Started: the Task has been taken by a Worker
Completed: the Task has finished without errors

Then we introduce non-usual cases:

Failed: the Task has finished with errors
Cancelled: the Task has been cancelled
Died: the Task has been ran on a Node which has stopped

At this point we try to regroup Events by aggregate. In our case, the domain is very narrow, so we have a single aggregate Task.

The next step of an Event storming is to write-down the Commands (at imperative tense):

Create will create Created
Start will create Started
Complete will create Completed
Fail will create Failed
Cancel will create Cancelled

A Command, contrary to an Event, can fail, it is a wish. For example, if you try to send a Complete Command to a Task which has not received Started, it will fail.

Matching the emerged design to reality

Now we have reached to a comprehension, it’s time to misuse our work: as stated earlier, an Event Storming is a communication tool, not a design tool.

Anyway, we have begun to implement as it:

We keep our TaskManager as front-end, but instead of directly performing actions, it just generates Commands
When a Command is generated, the associated CommandHandler contacts Cassandra to retrieve the associated existing Events, it rebuilds the aggregate and try to apply the Command. In case of success, a new Event is generated and persisted into Cassandra
Finally the Event in sent over RabbitMQ in broadcast to ensure that all concerned systems update their state

We have also used some sequence diagrams in order to explain our system better:

It has highlighted a problem: A Cancelled Event can be overrided by Completed or Failed.

Let’s imagine the following use-case:

A Client fires Cancel Command, the CommandHandler retrieve all the Events from Cassandra
The last Event is Started, so the CommandHandler fires Cancelled which is persisted in Cassandra and contacts RabbitMQ
At the same time, the Worker tries to persist the Completed Event, it fails, tries again and succeed (there is no reason to prevent a Started Task to complete)

Now you have an issue: your Task is both Cancelled and Completed.

Do you know Schrödinger’s Cat Sheldon?

Which leads us to add a last Event: EffectivelyCancelled.

Conclusion and thoughts

Even if Event Storming is a great tool to ensure a consensus about a linguistic context, it is a very poor designing tool. It will not give you hints on how a system will work, it’s miles away from a technical point of view.

The only use of Event Storming is to give you names to use when developing/architecturing a system.

Note: Most of the code examples come from the James code base and are de facto licensed under the Apache License, Version 2.0.

Event Storming applied to distributed task management

Sketching the system thanks to Event Storming

Matching the emerged design to reality

Conclusion and thoughts

Written by Gautier DI FOLCO