How to overcome distributed testing pitfalls with Elixir

Vinicius Teles Bezerra
grandcentrix
Published in
5 min readApr 9, 2019

Testing distributed systems can be really hard. Initially, the developer who wants to write a test in this magnitude needs to understand, or at least grasp some complex concepts like concurrency, parallel programming, nodes, availability and eventual consistency. These concepts are really important in the context of IoT: Being able to have a clear state of entities and being fault tolerant is one of the key points to have a fantastic IoT solution.

Photo by Hans-Peter Gauster on Unsplash

But what does that mean? If we want to communicate to a device by using a low latency protocol like MQTT, for example, Elixir will be responsible to deal with a lot of messages at the same time. Elixir can take a lot of hits and it really doesn’t care about it. Actually, the Erlang VM was designed to do it. Can you guess what‘s about to happen if a message breaks the application because the application is not designed to handle the message? If you think that the entire application will break and shut down or flood the entire stack with the error and won’t accept more requests, you are totally wrong.

Elixir is fault tolerant by design and can handle unexpected errors in isolated processes, which means that other components of the application won’t be affected. Of course, the programmer has to design his own application, but the necessary tools are there. We just need to use them.

Elixir is a relatively new programming language, so there is no unanimous lib to go for and start to write some distributed tests. Of course, being an open source oriented language will help on this with different libs and solutions.

Obviously, you could try to write distributed tests with just ExUnit but this might lead to the following problems:

  1. The process might not be available.
  2. Testing connections across the cluster of nodes would be natively impossible.
  3. Testing nodes availability will be impossible.
  4. State changes in different nodes are non-deterministic.
  5. How would your event start with testing the monitoring of nodes?

So how can we test these kind of scenarios?

Well, I think you guessed right, we need to spin up nodes for that. We can just spin up a local and a remote node. After this, you can rely on what Erlang built along the years by using a retry strategy or waiting some time in case of non-deterministic scenarios.

If you are dealing with some process encapsulation processing, there is no problem at all. In your tests you can use something like assert_receive or assert_received. These are functions provided by ExUnit and should be used depending upon your problem.

So far, so good, but how do you build the testing suite for distributed systems? First of all, let’s see how to spin up a node. If we want to use iex we can just use the command bellow:

Iex is Elixir’s interactive shell: If we just run iex, we have a process expecting command to execute. The option sname creates and assigns a name to the distributed node that we are spinning.

By using this command, we can spin up a node on the fly, but we actually want to build a set of tools to do this using the Elixir built-in functions. By investigating the Node module in more depth, you can find out that it offers a function called start:

It looks like we got something working here, but if you take a closer look to the function documentation, you can see that this is used for starting the current non-distributed into a distributed node. So, what about spinning further nodes? Maybe we should just do this by using some bash magic, right? This might be a feasible possibility, but what if we could again integrate this command into our application code? Therefore, we can use Exexec. This lib provides us an interface in order to control and execute OS processes from Elixir.

Bellow we can see a version of Exexec in action solving this problem for us:

The -e flag tells elixir to evaluate this command and, of course, Exexec has the responsibility to run this command for us. We can even get back the pid in the VM as well as the pid in our OS.

If you try to run the command on the fly, you should get an error like this:

As you can see, there is a gen_server call to a named process :exec. To fix it, we just need to make sure that Exexec has its process running. Therefore, we just need to add these two lines:

This is the best practice and a really elegant way to ensure that the apps you need are up and running.

However, this is just the initial state of our testing suite for distributed systems and we still have a lot to do. For this reason, in a later post we are going to shed light on the following subjects:

  1. How to enhance our testing suite for distributed systems.
  2. How to use retry strategies for non-deterministic scenarios.
  3. How to communicate with different nodes in the cluster.
  4. How to be sure that a process is alive and kicking.

In this post we could just address the tip of the iceberg. As there are more subjects we have to cover in more detail, we will take a closer look to further solutions in order to solve these kind of problems in a follow-up post.

Photo by Tanner Vines on Unsplash

Testing a distributed system can be hard, but Erlang is a battle tested technology. One of its goals is to solve the distributed problem, so we should use this in our advantage and build a testing suite with what the language has to offer.

--

--