Supervisors and workers in 10 minutes

Comfortable learning curve for Elixir — part 5

Gaspar Chilingarov
Learn Elixir
8 min readJul 17, 2017

--

TL;DR — Supervisor provides standard way to declarative specification of how processes should be start, monitored for failures and restarted. Do the exercises in the article to get real feeling how process linking/monitoring/crashing work. ❤❤❤ :)

Supervision is another pattern of usage in Erlang and Elixir, which takes roots from telecom past of Erlang is supervisor and ‘let it crash’ ideology. It’s less risky to have one misbehaving process crash and be properly restated that take whole system into unknown, potentially wrong state.

Linked processes and crashes

Earlier I’ve told that processes can be linked at start and one crashing process will take down all linked processes as well. But this has little use — because it will make application stop. This is not the desired outcome, so there is the way to catch crashes and act based on them.

But let’s start slow and gradually add complexity. Below I want to show several examples how linked processes work.

Save this example in .exs file and execute with elixir command.

IO.puts “before”
spawn_link fn() -> :ok end
Process.sleep 100
IO.puts “after”

You will see it printing

before
after

because even the linked process exited, it exited normally without the crash.

Let’s run following example with crashing code and see what happens

IO.puts “before”
spawn_link fn() -> 1 = 2 end
Process.sleep 100
IO.puts “after”

The output will contain only before because the main process with command IO.puts “after” was also automatically terminated.

before
** (EXIT from #PID<0.74.0>) an exception was raised:
** (MatchError) no match of right hand side value: 2
01_linked_crash.exs:3: anonymous fn/0 in :elixir_compiler_0.__FILE__/1
12:16:29.067 [error] Process #PID<0.77.0> raised an exception
** (MatchError) no match of right hand side value: 2
01_linked_crash.exs:3: anonymous fn/0 in :elixir_compiler_0.__FILE__/1

There is a purpose of Process.sleep 100 in this code. If you remove it, second code sometimes also can print after, because spawn_link may be slow enough and failing match occurred after second IO.puts was executed.

So if all linked processes terminate if just one process in chain terminates with error code, how do we recover from errors?

Wait, there is a hope!

And it’s called :trap_exit. Actually that’w why there was a site called trapexit.org :)

Erlang/Elixir processes have flags, which control their behaviour and one of them is :trap_exit. When it is set to true crash of linked process does not terminate process, but instead sends message.

Let’s try it out.

Process.flag :trap_exit, true
IO.puts “before”
spawn_link fn() -> :ok end
Process.sleep 100
IO.puts “after”
receive do
msg -> IO.inspect msg, label: “received message”
end

Run it and it will output

before
after
received message: {:EXIT, #PID<0.77.0>, :normal}

So parent process received message telling that child #PID<0.77.0> exited with status :normal.

Now modify code to make it crash and let’s run it. Output will look like

before
12:28:03.706 [error] Process #PID<0.77.0> raised an exception
** (MatchError) no match of right hand side value: 2
02_linked_crash_trap_exit.exs:5: anonymous fn/0 in :elixir_compiler_0.__FILE__/1
after
received message: {:EXIT, #PID<0.77.0>,
{{:badmatch, 2},
[{:elixir_compiler_0, :”-__FILE__/1-fun-0-”, 0,
[file: ‘02_linked_crash_trap_exit.exs’, line: 5]}]}}

You see now detailed message of crash and as in previous example — also after. It means parent process survived child’s crash.

Parenting made easy

Simple Parent →Children relationship

Let’s make more complex example and write code which runs one parent and 3 children. Children processes will increase counter by 1 every second and when they reach limit, they will exit. When child exits, parent will restart that specific child.

Here is the code for a script that will just run child processes and show their exit messages

defmodule Parent do
def spawn_link(limits) do
spawn_link(__MODULE__, :init, [limits])
end

def init(limits) do
Process.flag :trap_exit, true

Enum.each(limits, fn(limit_num) ->
spawn_link(Child, :init, [limit_num])
end)

loop()
end

def loop() do
receive do
msg ->
IO.puts "Parent got message: #{inspect msg}"
loop()
end
end
end


defmodule Child do
def init(limit) do
loop(limit)
end

def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end


Parent.init([2,3,5])

Process.sleep 2_000

Run it and you will see that after processes reach `counter == 0` they exit and parent receives this message.

Manually implement children restart

Now lets add supervision and children restart. We should keep track which
children PID corresponds to which limit, so that we can restart it later
correctly. This means passing information from Parent.init to Parent.loop. For ease of lookup we will store this information in map.

defmodule Parent do
def spawn_link(limits) do
spawn_link(__MODULE__, :init, [limits])
end

def init(limits) do
Process.flag :trap_exit, true

children_pids = Enum.map(limits, fn(limit_num) ->
pid = run_child(limit_num)
{pid, limit_num}
end) |> Enum.into(%{})

loop(children_pids)
end

def loop(children_pids) do
receive do
{:EXIT, pid, _} = msg->
IO.puts "Parent got message: #{inspect msg}"

{limit, children_pids} = pop_in children_pids[pid]
new_pid = run_child(limit)

children_pids = put_in children_pids[new_pid], limit

IO.puts "Restart children #{inspect pid}(limit #{limit}) with new pid #{inspect new_pid}"

loop(children_pids)
end
end

def run_child(limit) do
spawn_link(Child, :init, [limit])
end
end

defmodule Child do
def init(limit) do
IO.puts "Start child with limit #{limit} pid #{inspect self()}"
loop(limit)
end

def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end

Parent.init([2,3,5])

Process.sleep 10_000

The most interesting part here is Parent.loop — it removes old mapping from children_pids and then re-adds new one with under newly spawned pid.

This was an simplified example of how OTP supervisor works. Actual logic is much complex and prone to mistakes in implementation, so it is better to use battle-proven code instead.

Common problems and why you should not invent bicycle

Try to change Parent.init([2,3,5]) to Parent.init([-2,3,5]) and run.

You will immediately recognize the problem — one of the children crashes constantly and creates CPU load as well.

Correct behaviour will be restart problematic child a few times and if problem did not went away — do something about it, but not to restart it again and again.

This is when OTP Supervisor comes handy.

Make Supervisor do heavy-lifting

OTP’s module Supervisor allows to define much powerful logic in declarative way. You describe which children to run, how to start them and what will be strategy to restart them. And it will do rest by itself.

In this example I modified Child module to match Supervisor expectations — it should export function start_link which returns on success {:ok, PID}.

Run this example and you will see that it is exactly same restart strategy, as
we had in manually written code.

defmodule Parent do
use Supervisor

def start_link(limits) do
Supervisor.start_link(__MODULE__, limits)
end

def init(limits) do
children = Enum.map(limits, fn(limit_num) ->
worker(Child, [limit_num], [id: limit_num, restart: :permanent])
end)

supervise(children, strategy: :one_for_one)
end
end

defmodule Child do
def start_link(limit) do
pid = spawn_link(__MODULE__, :init, [limit])
{:ok, pid}
end

def init(limit) do
IO.puts "Start child with limit #{limit} pid #{inspect self()}"
loop(limit)
end

def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end


Parent.start_link([2,3,5])

Process.sleep 10_000

worker() creates correct worker specifications and supervisor() executes them.

When you call worker() — you just fill data structure, declare what should be the state of supervisor, but no code is run yet, no pids are yet known.

One more time — no code executed on worker() call and no pids are known. Usually this is the source of major confusion.

Only with supervisor() is called child processes are created one by one.

Experiments with the code

  • Try to replace Parent.init([2,3,5]) with Parent.init([-2,3,5]) and run. You will see that instead of infinite loop Supervisor tries to run code 4 times and then exists, effectively shutting down script as well.
    That is what Supervisor is good for — if it understands that problem cannot be fixed by restarting, it just exits and hopes that supervisor one level above it can restart components to get whole system in right state again. I’ll show beauty of putting supervisors under supervisors and supervision tree in some later posts.
  • Try to replace supervise(children, strategy: :one_for_one) with supervise(children, strategy: :one_for_all). You will observe that as soon first process, which counts up to 2 exits — all counter processes are restarted.
  • Try to replace supervise(children, strategy: :one_for_one) with supervise(children, strategy: :rest_for_one) and change also Parent.init to be Parent.init([5,2,3]).
    You will observe that at the moment process with limit 2 exits — it also forces restart of process with limit 3. And when first process with limit 5 exits — it also restarts processes with limits 2 and 3.
    This is useful if processes next in children specification depend on correct state of previous ones. And if previous process exited — all next processes in children specification should be restarted.

Make Supervisor and GenServers play nice together

This is an exercise left for the reader to re-create Child module with GenServer behaviour implementation. It should take 1–2 hours at maximum.

Hints:

You cannot block for long in GenServer handlers or init, so that you need to use asynchronous way of doing that.

In init use :timer.send_after (or more efficient :erlang.send_after) to send to self() message :count_down after one second. Then you’ll need to add handle_info(:count_down, state) to handle that message.

Depending on state (in which you should hold counter value) you may respond differently in handle_info{:noreply, new_state} to continue operation or {:stop, :normal, new_state} to stop GenServer. Do not forget to call :timer.send_after if limit is over zero and you need more :count_down messages to reach zero.

Bonus exercise

  • Return to example in “Simple ParentChildren relationship” and add
    Process.flag :trap_exit, true in children processes too and make Parent exit after starting Child processes. For example make it crash with wrong match.

    Check if there is messages in the Child processes mailboxes
 receive do
msg -> IO.inspect msg
after 0 ->
:ok
end
```

This will try to retrieve message if there is any and if not will immediately continue execution, without blocking process.

You should learn that linking is actually works in both directions.

  • Rewrite “Simple Parent Children relationship” example using spawn instead of spawn_link and using Process.monitor. Monitor allows you to keep eye on PID even from processes which did not started pid. Process.monitor messages are different, so you’ll need to figure it by yourself.
  • In example “Make Supervisor do heavy-lifting” change line to Parent.start_link([2,2,2]) and figure out what is the source of the problem. Hint —you may want to use Enum.reduce instead of Enum.map :)

What’s next?

In next part I will explain converting stack machine from part 2 to GenServer and Supervisor. Stay tuned and subscribe to get updates when it’s out.

Also check out previous parts 1 2 3 4.

Have questions? Write responses to this article 😺

About me

I’m Gaspar Chilingarov . I facilitate DevOps transition, help moving legacy applications to cloud and write high-performance Elixir apps.

Need help with your Elixir app or want prototype your next microservice in Elixir? DM me on Twitter or Github.

You can connect with me on Twitter, Facebook, LinkedIn and GitHub.

Found this post useful? Kindly tap the ❤ button below! :) Let’s spread word about Elixir.

--

--

Gaspar Chilingarov
Learn Elixir

I facilitate DevOps transition, help moving legacy applications to the cloud and write high-performance Elixir apps.