Elixir — Supervisors: A Conceptual Understanding
Software written in Elixir / Erlang is known for it’s fault-tolerance. Fault-tolerance is simply the idea that the system can continue to function (even if only partially so) in the face of errors that the developer has not (and should not) anticipate.
One of the key concepts supporting fault-tolerance is the idea of supervisors and supervision trees. A supervisor is simply a process whose sole job is to monitor other processes — a.k.a., its children — such that when those other process die, the supervisor is informed and can take appropriate action.
To properly utilize the power of Elixir, and to build fault-tolerant applications, it is necessary to understand supervisors and how they function. In this post, we’ll go over how supervisors work, not how to use the Supervisor module. In other words, this post is about what is going on underneath the (great) abstractions provided by OTP.
Supervisors Are GenServer Processes
A supervisor is a process just like any other. In fact, not only is a supervisor just a process, but it is a process that, underneath the hood, implements the GenServer behavior. This implies that we can use our understanding of GenServers to analyze how a supervisor works! In this post, we’ll leverage that understanding of GenServers to look at the diff between a regular ‘ol GenServer and a Supervisor.
Starting up a Supervisor
Let me re-state that a supervisor is a process whose sole purpose for existing is to monitor other processes, and restarting them when necessary. This implies a few things about a supervisor.
- A Supervisor process must be started, since it is a process.
- A Supervisor process must be linked to the processes it will supervise, so that it can receive messages when those processes terminate.
- A Supervisor must trap exits, so that it doesn’t go down when one of the processes that it supervises goes down.
- A supervisor process must keep state of the processes it monitors were created, so that it can restart that process again.
Starting a Supervisor and Trapping Exits
Remember, this post is about how supervisors work, not about how to use the Supervisor module. So, let’s get to work on implementing our Supervisor.
In the simple code above, we have actually managed to knock out two of our four requirements. Specifically, by calling GenServerSupervisor.start_link/0 we are starting our process, and that process will call GenServerSupervisor.init/1, which makes it trap exits.. This code will look very familiar if you have worked with GenServers, but I do want to point out Line 12. On Line 12, we are turning a normal process into a system process. The difference is that exit signals will now be sent to the system process’s mailbox. If we did not do this, then our supervisor process would also terminate when one of the supervised processes terminate.
Linking Supervised Processes
Another requirement of a Supervisor is that the processes it is meant to supervise must be linked to it, so that the supervisor can receive those exit signals in its mailbox.
The way this is achieved in Elixir is by passing a list of child specifications to the supervisor. The supervisor will then start each child process, and in so doing it will link that child process to itself.
A child specification is a very simple concept. It is a three-element tuple, with the first element of the tuple being a module, the second element is a function within that module, and the third element are the arguments that the function takes. Our GenServerSupervisor.init/1 function will iterate over the child specification list. Let’s modify our code to show that:
We’ve added a lot of code here, so let’s tread through it little by little. First, notice that the way we start our Supervisor process has changed. Instead of calling GenServerSupervisor.start_link/0, we now have to pass in an argument (Line 5). That argument is a list of child specifications, as we already covered.
Notice, too, that the GenServerSupervisor.init/1 function calls GenServerSupervisor.start_children/1 before returning(Line 14). Remember that our start_link/1 function is executed in the current process, but the init/1 function is executed in the process created by start_link/1. This is important because we are trying to link the supervisor process with the process it will supervise, so we need the supervisor process to call init/1.
Now, let’s take a look at our helper processes. First, notice that start_children/1 calls start_child/1 on each of the child_specifications. (These are the functions defined on Line 18, Line 26, and Line 28). There are two main items to point out here:
The start_child/1 process calls apply/3, which takes a module, the function you want call, and the argument you want to call that function with. Normally, what is happening here is that we are creating another process, generally a GenServer worker. (No need to get into supervision trees just yet). Therefore, the tuple that start_child/1 receives will likely look like this:
{GenServerWorker, :start_link, []}
There is one crucial piece that is implied and hidden from us in this code, and that is that whatever process we start by using apply/3 must be created using the spawn_link/3 function. Otherwise, the supervisor and the process created will not be linked, and therefore it won’t be supervised.
Finally, notice that our start_children/1 function returns a list of tuples, each tuple consisting a pid and the child_specification that was used to start that process (Line 19–21). We are not doing anything with this yet, so let’s get on that.
Managing the Supervised Processes
So, at this point, we’ve managed to start the supervisor, trap exits, start the supervised processes, and link the supervisor with the supervised processes.
All that’s left to do is make the supervisor restart process when they terminate, based on logic we implement.
This requires that our supervisor keep track of which child specification was used to create each supervised process. Why? Because that process will send an exit signal our supervisor, in the form of:
{:EXIT, pid, reason}
Since the pid is sent, our supervisor can figure out how the process that terminated and sent the exit signal was started, and it can restart it.
Let’s take a look at some code:
First, we are now capturing the state properly for our supervisor (Line 15). This state is simply a map, where each key is a pid, and each value is a child spec.
Secondly, we have implemented a handle_info/2 function. This function is meant to be called whenever our supervisor receives an exit signal. It will find the pid that sent the signal, and replace it with another process started the same way as the original process.
Done
Finito! That’s a minimum implementation of a supervisor. The point was not to build a system-ready implementation of the supervisor. Instead, it was meant to remove the magic from what a supervisor is, and to understand what is going on conceptually when we use a proper implementation of Supervisor.