split_brain.ex

The challenges of building a small distributed Elixir Application

iacobson
Qixxit Development
6 min readMar 26, 2019

--

When building a small application is quite easy to overlook the (horizontal ) scalability related aspects. Especially in the early phases. You want it ready and deployed in production. You can build on top of it later on. And I don't say it’s a bad approach. But at least keeping in mind that one day your application may run on more than one server, can save you of some trouble.

Especially with Elixir, if your application relies on GenServers, Agents or, generally, state holding processes. Let’s see why by building a small demo application.

The Context

The demo application is called … MyApp 😊 and it holds and changes the state of an Entity. To keep things simple, the Entity has just two attributes: id and state. The application exposes 2 functions: one to set the Entity state and the other to get the Entity.

The main requirements of the app are:

  • each Entity has an associated process (GenServer)
  • the function that gets the Entity, should return it from the GenServer state. Let’s assume that this is a very expensive operation and we don’t want to query the database every time we need to return the entity.

The Entity state has some possible values: :initialized | :working | :finalized

The Entity model may look like this:

Assumptions

  • the app is part of a bigger, more complex application.
  • there is no real database. We hold the id and state for 2 entities in 2 files: entity_1and entity_2. Their id values are the same as the file names.
  • we do not handle the workers stopping. For a real application, this will probably create memory issues over time.
  • we’ll handle mostly the success responses. Dealing with errors is not in the scope of this exercise.

Building the app

We start with a supervised Elixir app: mix new my_app --sup. The application supervisor starts a dynamic supervisor who will handle the workers (one for every Entity). On state change, the worker triggers the database update.

Persistency

This is the persistence layer of the application.

We don’t go much into details here. The Database module knows how to get an Entity and set its state from a persisted file. You can check the full code on Github: https://github.com/iacobson/blog_state_holder.

The Entity Worker

A GenServer is responsible with the Entity processing. For the demo app, it will handle just 2 actions:

  • get an Entity from the server state
  • set the state for an Entity

The worker starts as a named process. The Entity id will be the unique server name. On the worker start, its state is set by getting the Entity from the database. All subsequent requests to get_entity/1 will return the Entity from the worker state. The state will be updated only by set_entity_state/2.

The Worker Supervisor

This is just a normal dynamic supervisor. The only thing worth mentioning is the get_or_start_worker/1 function. It takes an Entity id as argument. Then tries to start a worker child. But if that process already exists, it will not error, but just return the pid. This way we ensure that if the process is not already started, it will be started with any potential request.

For example, if the Entity worker process stops unexpectedly, the supervisor takes care and restarts it. But what if the whole app gets restarted? Future calls concerning the specific Entity will fail.

The public functions

The MyApp module holds the functions exposed to the rest of our application, or other applications (eg. in an umbrella).

First, it gets or starts the Entity worker process. Once we know the worker is up, we send the request to get or set the state of an Entity.

Time to play

Start the server: `iex -S mix`

iex(1)> MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :initialized}}
iex(2)> MyApp.set_entity_state("entity_1", :working)
:ok
iex(4)> MyApp.set_entity_state("entity_2", :finalized)
:ok
iex(5)> MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :working}}

Everything works as expected. We can get Entities and set their state in a consistent manner.

Time to grow

The app does its job properly. But the traffic on our platform increases more and more. And we decide it’s time to spin up a new server. Initially, everything looks good. The CPU and memory look fine again on both server and it seems we took the right decision.

But soon we start to find inconsistencies with the Entity state.

Testing the app on two servers

Start 2 application servers and let’s see what happens.

server_1: MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :initialized}}
server_2: MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :initialized}}

Both servers will get their initial state from the same database, which is good.

But what happens if state updating calls are handled by both servers for the same Entity?

server_1: MyApp.set_entity_state("entity_1", :working)
:ok
server_2: MyApp.set_entity_state("entity_1", :finalized)
:ok

Let’s get the Entity on server_2

server_2: MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :finalized}}

Good! Now the same request on server_1. Which, of course, should return the same state for the same Entity. Surprise !!!

server_1: MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :working}}

For the same Entity, server_1 returns :working state, while server_2 returns :finalized. That’s bad! Many things can go wrong from here. We quickly realize the optimization that returns the Entity form the worker's state is not reliable anymore.

What can we do?

Probably the first thing to quickly fix the problem would be to get the Entities directly from the database. With the cost of performance.

But is there a way to keep the previous performance and still be able to run the app on multiple servers? It’s an Elixir application, so yes!

Connecting the nodes

From the Elixir docs:

Elixir ships with facilities to connect nodes and exchange information between them. In fact, we use the same concepts of processes, message passing and receiving messages when working in a distributed environment because Elixir processes are location transparent.

We first need to start the app with a unique name. Instead of server_1 and server_2 I will use the classic foo and bar. Open two terminal tabs and start the apps:

iex --name foo@127.0.0.1 -S mix
iex(foo@127.0.0.1)1>
iex --name bar@127.0.0.1 -S mix
iex(bar@127.0.0.1)1>

And connect the two nodes:

iex(foo@127.0.0.1)1> Node.connect(:"bar@127.0.0.1")
true

Running the same example as above, we end up with the same inconsistencies. But that is normal. Even if the nodes are connected, they are not aware of the processes started on the other nodes.

Enter :global

The GenServer name can be registered globally and all connected nodes will be aware of it. Under the hood, this functionality makes use of the Erlang :global module.

In terms of code, the changes are minor:

The only update in the code is wrapping the GenServer name in a tuple with :gobal as the first element.

That’s it! Restart and connect the nodes and let’s try it:

iex(bar@127.0.0.1)2> MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :initialized}}
iex(foo@127.0.0.1)4> MyApp.set_entity_state("entity_1", :working)
:ok
iex(bar@127.0.0.1)3> MyApp.get_entity("entity_1")
{:ok, %MyApp.Entity{id: "entity_1", state: :working}}

We updated the state on the foo node and we retrieve the correct Entity state on bar. That’s because we now have a single {:global, :entity_1} process living on one of the nodes and being accessible to all other connected nodes.

Not quite there!

We did some good progress, but there’s one last important point: the moment when we connect the nodes.

Why is that important? Imagine the following. Just before connecting the nodes, each of them receives a request to update the state. At this point, even if the GenServer registers globally, there’s no connection between nodes. So each server will spawn its own process and it will end up again in an inconsistent state.

Even worse, when connecting the nodes, the one we are connecting to detects a name conflict and terminates the entity_1 process.

global: Name conflict terminating {:entity_1, #PID<15181.143.0>}

This way we may end up with a single globally registered process, which has a wrong state. Again, this is bad!

The solution to this problem is to connect the nodes as early as possible in the application startup. We do this for our demo app that runs locally. But how you do it in a production environment depends a lot on the infrastructure, deployment, etc. so it is out of the scope of this article.

In our case we know that we have 2 servers and we know their names, making things simpler.

  • we add the list of nodes in the config.exs:
use Mix.Configconfig :my_app, nodes: [:"foo@127.0.0.1", :"bar@127.0.0.1"]
  • update the application.ex file to connect the nodes before starting any process:

This ensures that the nodes are connected before any worker can be started. Now we truly have a unique Entity worker process for all nodes that holds a correct state.

Conclusion

Horizontally scaling Elixir apps may be tricky if you are not prepared for it and the application was designed for a single server. Luckily, Elixir offers out of the box tools to run distributed apps and register their processes globally.

Those are powerful tools to achieve concurrency and keep consistency at the same time.

PS: thanks to Tobias Kräntzer for introducing me to Erlang :global concepts.

--

--