Event Storming applied to distributed task management
For months James has been under huge work to make it distributed.
The last remaining part was the WebAdmin which allows to manage long-running tasks (creating, stopping, watching), such as ReIndexing mailboxes.
It is currently represented via our TaskManager
interface:
We currently have one implementation (MemoryTaskManager
) which use ExecutorService
as back-end. It's the first issue we have to face: MemoryTaskManager
have two responsibilities:
- Managing
Task
s - Running
Task
s
Making TaskManager
distributed implies being able to:
- Start a
Task
from any node of a cluster - List every
Task
s of the cluster - Query any
Task
s of the cluster - Cancel any
Task
of the cluster - Await for termination of
Task
s of the cluster
Our first idea was to persist all the Task
s, and designate one node to run them. Anyone who have ever read about distributed systems know that distributed consensus is generally a bad idea.
We have taken a step back and listed what James is currently using to work:
We have three main concerns:
- Persisting
Task
s (to be listed), it will be Cassandra's responsibility - Communicating between nodes (to be able to pick a
Task
only once, cancel or wait for aTask
), it will be done by RabbitMQ - Keeping track of a
Task
state which is a perfect fit for our event-system
Sketching the system thanks to Event Storming
Event Storming is a meeting where all the concerned actors (developers, product owners, business analysts, domain experts, etc.) are present. The goal of this meeting is that every attendees get the same understanding of a limited system. Event Storming usually begins by listing all the Event
s of a given domain, which are expressed in past tense.
We try to represent all the Event
s happening to a Task
in the most basic scenario:
Created
: it is theEvent
which initializes the aggregate (the coherence unit which will receivesEvent
s)Started
: theTask
has been taken by aWorker
Completed
: theTask
has finished without errors
Then we introduce non-usual cases:
Failed
: theTask
has finished with errorsCancelled
: theTask
has been cancelledDied
: theTask
has been ran on a Node which has stopped
At this point we try to regroup Event
s by aggregate. In our case, the domain is very narrow, so we have a single aggregate Task
.
The next step of an Event storming is to write-down the Command
s (at imperative tense):
Create
will createCreated
Start
will createStarted
Complete
will createCompleted
Fail
will createFailed
Cancel
will createCancelled
A Command
, contrary to an Event
, can fail, it is a wish. For example, if you try to send a Complete
Command
to a Task
which has not received Started
, it will fail.
Matching the emerged design to reality
Now we have reached to a comprehension, it’s time to misuse our work: as stated earlier, an Event Storming is a communication tool, not a design tool.
Anyway, we have begun to implement as it:
- We keep our
TaskManager
as front-end, but instead of directly performing actions, it just generatesCommand
s - When a
Command
is generated, the associatedCommandHandler
contactsCassandra
to retrieve the associated existingEvent
s, it rebuilds the aggregate and try to apply theCommand
. In case of success, a newEvent
is generated and persisted intoCassandra
- Finally the
Event
in sent overRabbitMQ
in broadcast to ensure that all concerned systems update their state
We have also used some sequence diagrams in order to explain our system better:
It has highlighted a problem: A Cancelled
Event
can be overrided by Completed
or Failed
.
Let’s imagine the following use-case:
- A Client fires
Cancel
Command
, theCommandHandler
retrieve all theEvent
s fromCassandra
- The last
Event
isStarted
, so theCommandHandler
firesCancelled
which is persisted inCassandra
and contactsRabbitMQ
- At the same time, the
Worker
tries to persist theCompleted
Event
, it fails, tries again and succeed (there is no reason to prevent aStarted
Task
to complete)
Now you have an issue: your Task
is both Cancelled
and Completed
.
Do you know Schrödinger’s Cat Sheldon?
Which leads us to add a last Event
: EffectivelyCancelled
.
Conclusion and thoughts
Even if Event Storming is a great tool to ensure a consensus about a linguistic context, it is a very poor designing tool. It will not give you hints on how a system will work, it’s miles away from a technical point of view.
The only use of Event Storming is to give you names to use when developing/architecturing a system.
Note: Most of the code examples come from the James code base and are de facto licensed under the Apache License, Version 2.0.