Supervisors and workers in 10 minutes
Comfortable learning curve for Elixir — part 5
TL;DR — Supervisor
provides standard way to declarative specification of how processes should be start, monitored for failures and restarted. Do the exercises in the article to get real feeling how process linking/monitoring/crashing work. ❤❤❤ :)
Supervision is another pattern of usage in Erlang and Elixir, which takes roots from telecom past of Erlang is supervisor and ‘let it crash’ ideology. It’s less risky to have one misbehaving process crash and be properly restated that take whole system into unknown, potentially wrong state.
All code samples can be found in my github repo.
Linked processes and crashes
Earlier I’ve told that processes can be linked at start and one crashing process will take down all linked processes as well. But this has little use — because it will make application stop. This is not the desired outcome, so there is the way to catch crashes and act based on them.
But let’s start slow and gradually add complexity. Below I want to show several examples how linked processes work.
Save this example in .exs
file and execute with elixir
command.
IO.puts “before”
spawn_link fn() -> :ok end
Process.sleep 100
IO.puts “after”
You will see it printing
before
after
because even the linked process exited, it exited normally without the crash.
Let’s run following example with crashing code and see what happens
IO.puts “before”
spawn_link fn() -> 1 = 2 end
Process.sleep 100
IO.puts “after”
The output will contain only before
because the main process with command IO.puts “after”
was also automatically terminated.
before
** (EXIT from #PID<0.74.0>) an exception was raised:
** (MatchError) no match of right hand side value: 2
01_linked_crash.exs:3: anonymous fn/0 in :elixir_compiler_0.__FILE__/112:16:29.067 [error] Process #PID<0.77.0> raised an exception
** (MatchError) no match of right hand side value: 2
01_linked_crash.exs:3: anonymous fn/0 in :elixir_compiler_0.__FILE__/1
There is a purpose of Process.sleep 100
in this code. If you remove it, second code sometimes also can print after
, because spawn_link
may be slow enough and failing match occurred after second IO.puts
was executed.
So if all linked processes terminate if just one process in chain terminates with error code, how do we recover from errors?
Wait, there is a hope!
And it’s called :trap_exit
. Actually that’w why there was a site called trapexit.org :)
Erlang/Elixir processes have flags, which control their behaviour and one of them is
:trap_exit
. When it is set totrue
crash of linked process does not terminate process, but instead sends message.
Let’s try it out.
Process.flag :trap_exit, true
IO.puts “before”
spawn_link fn() -> :ok end
Process.sleep 100
IO.puts “after”receive do
msg -> IO.inspect msg, label: “received message”
end
Run it and it will output
before
after
received message: {:EXIT, #PID<0.77.0>, :normal}
So parent process received message telling that child #PID<0.77.0>
exited with status :normal
.
Now modify code to make it crash and let’s run it. Output will look like
before
12:28:03.706 [error] Process #PID<0.77.0> raised an exception
** (MatchError) no match of right hand side value: 2
02_linked_crash_trap_exit.exs:5: anonymous fn/0 in :elixir_compiler_0.__FILE__/1
after
received message: {:EXIT, #PID<0.77.0>,
{{:badmatch, 2},
[{:elixir_compiler_0, :”-__FILE__/1-fun-0-”, 0,
[file: ‘02_linked_crash_trap_exit.exs’, line: 5]}]}}
You see now detailed message of crash and as in previous example — also after
. It means parent process survived child’s crash.
Parenting made easy
Simple Parent →Children relationship
Let’s make more complex example and write code which runs one parent and 3 children. Children processes will increase counter by 1 every second and when they reach limit, they will exit. When child exits, parent will restart that specific child.
Here is the code for a script that will just run child processes and show their exit messages
defmodule Parent do
def spawn_link(limits) do
spawn_link(__MODULE__, :init, [limits])
end
def init(limits) do
Process.flag :trap_exit, true
Enum.each(limits, fn(limit_num) ->
spawn_link(Child, :init, [limit_num])
end)
loop()
end
def loop() do
receive do
msg ->
IO.puts "Parent got message: #{inspect msg}"
loop()
end
end
end
defmodule Child do
def init(limit) do
loop(limit)
end
def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end
Parent.init([2,3,5])
Process.sleep 2_000
Run it and you will see that after processes reach `counter == 0` they exit and parent receives this message.
Manually implement children restart
Now lets add supervision and children restart. We should keep track which
children PID corresponds to which limit, so that we can restart it later
correctly. This means passing information from Parent.init
to Parent.loop
. For ease of lookup we will store this information in map.
defmodule Parent do
def spawn_link(limits) do
spawn_link(__MODULE__, :init, [limits])
end
def init(limits) do
Process.flag :trap_exit, true
children_pids = Enum.map(limits, fn(limit_num) ->
pid = run_child(limit_num)
{pid, limit_num}
end) |> Enum.into(%{})
loop(children_pids)
end
def loop(children_pids) do
receive do
{:EXIT, pid, _} = msg->
IO.puts "Parent got message: #{inspect msg}"
{limit, children_pids} = pop_in children_pids[pid]
new_pid = run_child(limit)
children_pids = put_in children_pids[new_pid], limit
IO.puts "Restart children #{inspect pid}(limit #{limit}) with new pid #{inspect new_pid}"
loop(children_pids)
end
end
def run_child(limit) do
spawn_link(Child, :init, [limit])
end
end
defmodule Child do
def init(limit) do
IO.puts "Start child with limit #{limit} pid #{inspect self()}"
loop(limit)
end
def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end
Parent.init([2,3,5])
Process.sleep 10_000
The most interesting part here is Parent.loop
— it removes old mapping from children_pids
and then re-adds new one with under newly spawned pid.
This was an simplified example of how OTP supervisor works. Actual logic is much complex and prone to mistakes in implementation, so it is better to use battle-proven code instead.
Common problems and why you should not invent bicycle
Try to change Parent.init([2,3,5])
to Parent.init([-2,3,5])
and run.
You will immediately recognize the problem — one of the children crashes constantly and creates CPU load as well.
Correct behaviour will be restart problematic child a few times and if problem did not went away — do something about it, but not to restart it again and again.
This is when OTP Supervisor comes handy.
Make Supervisor
do heavy-lifting
OTP’s module Supervisor
allows to define much powerful logic in declarative way. You describe which children to run, how to start them and what will be strategy to restart them. And it will do rest by itself.
In this example I modified Child
module to match Supervisor
expectations — it should export function start_link
which returns on success {:ok, PID}
.
Run this example and you will see that it is exactly same restart strategy, as
we had in manually written code.
defmodule Parent do
use Supervisor
def start_link(limits) do
Supervisor.start_link(__MODULE__, limits)
end
def init(limits) do
children = Enum.map(limits, fn(limit_num) ->
worker(Child, [limit_num], [id: limit_num, restart: :permanent])
end)
supervise(children, strategy: :one_for_one)
end
end
defmodule Child do
def start_link(limit) do
pid = spawn_link(__MODULE__, :init, [limit])
{:ok, pid}
end
def init(limit) do
IO.puts "Start child with limit #{limit} pid #{inspect self()}"
loop(limit)
end
def loop(0), do: :ok
def loop(n) when n > 0 do
IO.puts "Process #{inspect self()} counter #{n}"
Process.sleep 500
loop(n-1)
end
end
Parent.start_link([2,3,5])
Process.sleep 10_000
worker()
creates correct worker specifications and supervisor()
executes them.
When you call worker() — you just fill data structure, declare what should be the state of supervisor, but no code is run yet, no pids are yet known.
One more time — no code executed on worker() call and no pids are known. Usually this is the source of major confusion.
Only with supervisor()
is called child processes are created one by one.
Experiments with the code
- Try to replace
Parent.init([2,3,5])
withParent.init([-2,3,5])
and run. You will see that instead of infinite loopSupervisor
tries to run code 4 times and then exists, effectively shutting down script as well.
That is whatSupervisor
is good for — if it understands that problem cannot be fixed by restarting, it just exits and hopes that supervisor one level above it can restart components to get whole system in right state again. I’ll show beauty of putting supervisors under supervisors and supervision tree in some later posts. - Try to replace
supervise(children, strategy: :one_for_one)
withsupervise(children, strategy: :one_for_all)
. You will observe that as soon first process, which counts up to 2 exits — all counter processes are restarted. - Try to replace
supervise(children, strategy: :one_for_one)
withsupervise(children, strategy: :rest_for_one)
and change alsoParent.init
to beParent.init([5,2,3])
.
You will observe that at the moment process with limit 2 exits — it also forces restart of process with limit 3. And when first process with limit 5 exits — it also restarts processes with limits 2 and 3.
This is useful if processes next in children specification depend on correct state of previous ones. And if previous process exited — all next processes in children specification should be restarted.
Make Supervisor and GenServers play nice together
This is an exercise left for the reader to re-create Child
module with GenServer
behaviour implementation. It should take 1–2 hours at maximum.
Hints:
You cannot block for long in GenServer
handlers or init
, so that you need to use asynchronous way of doing that.
In init
use :timer.send_after
(or more efficient :erlang.send_after
) to send to self()
message :count_down
after one second. Then you’ll need to add handle_info(:count_down, state)
to handle that message.
Depending on state (in which you should hold counter value) you may respond differently in handle_info
— {:noreply, new_state}
to continue operation or {:stop, :normal, new_state}
to stop GenServer
. Do not forget to call :timer.send_after
if limit is over zero and you need more :count_down
messages to reach zero.
Bonus exercise
- Return to example in “Simple Parent→Children relationship” and add
Process.flag :trap_exit, true
in children processes too and makeParent
exit after startingChild
processes. For example make it crash with wrong match.
Check if there is messages in theChild
processes mailboxes
receive do
msg -> IO.inspect msg
after 0 ->
:ok
end
```
This will try to retrieve message if there is any and if not will immediately continue execution, without blocking process.
You should learn that linking is actually works in both directions.
- Rewrite “Simple Parent → Children relationship” example using
spawn
instead ofspawn_link
and usingProcess.monitor
. Monitor allows you to keep eye on PID even from processes which did not started pid.Process.monitor
messages are different, so you’ll need to figure it by yourself. - In example “Make Supervisor do heavy-lifting” change line to
Parent.start_link([2,2,2])
and figure out what is the source of the problem. Hint —you may want to useEnum.reduce
instead ofEnum.map
:)
What’s next?
In next part I will explain converting stack machine from part 2 to GenServer
and Supervisor
. Stay tuned and subscribe to get updates when it’s out.
Also check out previous parts 1 2 3 4.
Have questions? Write responses to this article 😺
About me
I’m Gaspar Chilingarov . I facilitate DevOps transition, help moving legacy applications to cloud and write high-performance Elixir apps.
Need help with your Elixir app or want prototype your next microservice in Elixir? DM me on Twitter or Github.
You can connect with me on Twitter, Facebook, LinkedIn and GitHub.
Found this post useful? Kindly tap the ❤ button below! :) Let’s spread word about Elixir.