Introducing Horde — a distributed Supervisor in Elixir
The last few weeks I’ve been toiling away at Horde. Horde is a distributed supervisor and a distributed registry, built using the magic of delta-CRDTs.
Horde was inspired very heavily by Swarm and built to address some perceived shortcomings of Swarm’s design.
You should use Horde when you want a global supervisor (or global registry, or some combination of the two) that supports automatic fail-over, dynamic cluster membership, and graceful node shutdown.
Supervisor / Registry API
Horde mirrors the API of Elixir’s Supervisor and Registry as much as possible, and in fact it runs its own Supervisor per node, distributing processes among the cluster’s nodes using a simple hash function (ala Swarm).
Aside from some additional code to glue together supervisors into a distributed supervisor, Horde should be a drop-in replacement for Elixir’s Supervisor or Registry.
Inspired by Swarm
While Swarm’s global process registry blurs the line between a registry and a supervisor (for example, using register_name/5, Swarm will start and restart a process for you, but not otherwise supervise your process), Horde maintains a strict separation of supervisor from registry.
This is the biggest difference between Swarm and Horde and resolves some problems stemming from Swarm’s blurring of these concepts.
Thus, Horde provides both Horde.Supervisor and Horde.Registry, and it’s up to you as developer to decide how you want to mix and match them, just like a regular supervisor / registry. This is a big advantage if you want to run a supervisor tree underneath Horde.Supervisor for example, and not just singular processes.
Node shutdown: process draining
Swarm’s use of distributed Erlang to determine which nodes are in the swarm is a limiting factor when wanting to implement graceful shutdown.
This is solved in Horde by not relying on distributed Erlang in this way. Nodes still need to be connected with distributed Erlang, but Horde cluster membership is registered separately, enabling one to remove a node from the horde while leaving it connected to the Erlang cluster for a time. Processes will then be drained from the removed node and restarted on another node in the cluster.
Horde is built on delta-CRDTs. CRDTs (conflict-free replicated data types) are guaranteed to converge (eventually, but Horde communicates aggressively to keep divergences to a minimum), which is a very handy property to have when building distributed systems.
Being able to say with certainty (barring any bugs in the implementation, of course) that the distributed state will *always* converge does give some peace of mind.
Maturity / Feedback
So how mature is Horde’s code? At this point in time I would say it’s at alpha, bordering on beta. Horde needs people to test it out and report their findings. It’s still early / medium days but we will be testing it intensively in the coming weeks to ensure that it’s production-ready and as bug-free as possible.