How to test Elixir Cluster of nodes using slaves

Lorenzo Sinisi
elixir-bytes
Published in
5 min readJan 17, 2018

The beauty of Elixir/Erlang is that it ships with facilities to connect multiple nodes and have a cluster that can really take advantage of message passing, having a distributed environment out of the box. It doesn’t matter if the recipient process is on the same node or on another node, the VM will be able to deliver the message in both cases.

There are different ways to connect Elixir nodes in production. One of them, on which I will focus is the following:

- you create a release
- start multiple nodes (maybe 3)
- connect the nodes together (maybe using the function Node.connect/1)

Everything looks good and works, right?

But what if you have some background job processing that has to start ‘on startup’ and should be executed in one and only one node at the time? What happens between the moment on which you start the 3 nodes and the moment when they get connected to each other? What happens to globally registered processes that are automatically started?

Well, what happens is that only some of them will die and, depending on the restart strategy used, some or just one will be restarted.

I had the problem of having a soft of cronjob running in a cluster of Elixir nodes that should have been executed only when the current node is connected to a cluster. In other words, in the time between starting the node and the moment on which it is connected to another one, a specific function should not be executed.

Why?

This is what can happen otherwise:
1. you create a release
2. your start multiple nodes (let’s say ‘a’, ‘b’ and ‘c’)
3. each node automatically start a process named something like :background_work
4. at this point a, b and c will all have started their own globally registered process called :background_work
5. when connecting the nodes in a cluster, the processes called :background_work will be killed because the naming conflict and only one of them will be restarted

The problem here is what happens between the point 4 and 5.

If the :background_work process starts immediately and starts calling some function, we could have the same function executing on the same row of the same database at the same time, for example. While this is not a problem for Erlang itself, this becomes a problem when you have to deal with the side effects of it (i.e. updating records in the DB).

In order to prevent the background process to start ‘processing’ jobs before making sure that the current node is in a cluster, we could write some dirty code like the following:

number_of_nodes = Enum.count([node() | Node.list()])
if number_of_nodes > 1, do: background_processing()

What do we gain from this code?

If the number of nodes is greater than one means that the current node is not isolated, it is indeed part of a cluster, so it is allowed to run the distributed task.

Now, how do we test this code? How do we test that a function runs only in a cluster and not in one isolated node?

We could use slaves. Thankfully that was not the first time that somebody had a problem like this so I had the chance to take inspiration and write a small Elixir module to run a cluster of Elixir nodes from an iex shell.

Please note that you should never, never, ever, never, never, never run the code below in production. And remember that you should always, always, always remember to kill the slaves once done with you tests.

# open an iex shell and type the following::ok = :net_kernel.monitor_nodes(true)
_ = :os.cmd(‘epmd -daemon’)
{ok, master} = Node.start(:master@localhost, :shortnames)
setup_slaves = fn(limit) ->
Enum.each(1..limit, fn(index) ->
:slave.start_link(:localhost, ‘slave_#{index}’)
end)
[node() | Node.list()]
end
# setup_slaves.(5) # call this function to create a cluster of 6 nodes

Please note that you should never, never, ever, never, never, never run the above code in production. And remember that you should always, always, always remember to kill the slaves once done with you tests.

Let’s wrap this up with a practical example

We have a function of a module that, for some reason, should not run in multiple nodes at the same time (not even if they are disconnected from each other), this function should be run only when we are in a cluster of nodes. So that running it in isolation will not be allowed. Imagine, again, a scheduled job.

We could update this module that I created running ‘mix new multinode’:

# lib/multinode.exdefmodule Multinode do
@moduledoc """
Documentation for Multinode.
"""
def hello do
list_of_nodes = [node() | Node.list()]
number_of_nodes = list_of_nodes |> Enum.count()
if number_of_nodes > 1, do: "Yey! I am running in a cluster :D"
end
end

How do we test this?

One idea could be writing those two test case:

defmodule MultinodeTest do
use ExUnit.Case
doctest Multinode
test "does not greet the world in isolation" test "greets the world in a cluster"
end

We can add an helper to the test_helper.ex, creating the module TestCluster

defmodule TestCluster do
def start_slaves(number) do
:ok = :net_kernel.monitor_nodes(true)
_ = :os.cmd('epmd -daemon')
Node.start(:master@localhost, :shortnames)
Enum.each(1..number, fn(index) ->
:slave.start_link(:localhost, 'slave_#{index}')
end)
[node() | Node.list()]
end
def disconnect(list) do
Enum.map(list, &Node.disconnect(&1))
end
end
ExUnit.start()

At this point we can update our tests and they should pass 📗

defmodule MultinodeTest do
use ExUnit.Case
doctest Multinode
test "does not greet the world in isolation" do
assert Multinode.hello() == nil
end
test "greets the world in a cluster" do
slaves = TestCluster.start_slaves(3)
assert Multinode.hello() == "Yey! I am running in a cluster :D" TestCluster.disconnect(slaves) # cleanup the processes
end
end

This is how easy is to write tests for code that should run in a cluster of nodes only🌴 Let me know if you have any questions and if I should wrap this helper into a library 🙏

--

--