The Extinction of the Dodos (OTP style)

Brujo Benavides
Erlang Battleground
6 min readNov 1, 2016

--

This story shows a super weird OTP edge case detected by Marc Sugiyama on worker pool. It took me a while to come up with a simplified version of it. I tried to make it as funny as I could… I failed. Keep reading anyway, Marc’s discovery is really cool.

Dodos — Ice Age (2002) — Do you feel old already?

The Setup

To experience the issue found by Marc in WPool, you need to have a full blown OTP application with a proper supervision tree. So, I’ve built one just for this blog post. It represents an extremely simple world with just 2 animal species: dodos and cockroaches.

All functionality is provided through the module called world. Let’s start by creating the world (i.e. starting our app):

1> world:start().=PROGRESS REPORT==== 29-Oct-2016::00:31:18 ===
application: world
started_at: nonode@nohost
2>

Tip: To get progress reports, I just added sasl as a dependency of my app.

Since the world starts with no dodos and no cockroaches, you can add new ones this way:

2> world:add_dodos(10).
ok
3> world:add_cockroaches(7).
ok

Each animal is represented by a process, so if you ask for the list of all dodos, you’ll get a list of 10 pids:

5> world:dodos().
[<0.148.0>,<0.149.0>,<0.150.0>,<0.151.0>,<0.152.0>,
<0.153.0>,<0.154.0>,<0.155.0>,<0.156.0>,<0.157.0>]
6>

As you probably imagine, those processes are supervised. If you kill one, another one will take its place:

6> exit(hd(world:dodos()), kill).
true
7>
=SUPERVISOR REPORT==== 29-Oct-2016::00:41:41 ===
Supervisor: {local,dodos}
Context: child_terminated
7> world:dodos().
[<0.149.0>,<0.198.0>,<0.150.0>,<0.151.0>,<0.152.0>,
<0.153.0>,<0.154.0>,<0.155.0>,<0.156.0>,<0.157.0>]
8>

Let’s use observer to see how our supervision tree looks like…

The tree of life!

As you can see, all dodos are supervised by the same supervisor. All cockroaches are supervised by the same supervisor as well. In fact, if you check the code for world_sup, you can see that both supervisors are actually instances of the same module: animals_sup.

Now, world also comes with two deadly functions: kill_dodos and kill_cockroaches. Those functions lets us test what happens if you eradicate the entire population of a single species. Let’s try first with the cockroaches:

9> world:cockroaches().
[<0.160.0>,<0.161.0>,<0.162.0>,<0.163.0>,<0.164.0>,
<0.165.0>,<0.159.0>]
10> world:kill_cockroaches().
[true,true,true,true,true,true,true]
=SUPERVISOR REPORT==== 29-Oct-2016::00:53:33 ===
Supervisor: {local,cockroaches}
Context: child_terminated
Reason: killed
11> world:cockroaches().
[]
12> world:add_cockroaches(3).
ok
13> world:cockroaches().
[<0.270.0>,<0.271.0>,<0.272.0>]

As you can see, we killed them all. But cockroaches were not really gone for good. We could always add more later, which is expected. Let’s see if that’s true for dodos as well:

15> world:dodos().
[<0.149.0>,<0.198.0>,<0.150.0>,<0.151.0>,<0.152.0>,
<0.153.0>,<0.154.0>,<0.155.0>,<0.156.0>,<0.157.0>]
16> world:kill_dodos().
[true,true,true,true,true,true,true,true,true,true]
=SUPERVISOR REPORT==== 29-Oct-2016::00:58:04 ===
Supervisor: {local,dodos}
Context: child_terminated
Reason: killed
...
17> world:dodos().
** exception exit: {noproc,{gen_server,call,[dodos,which_children,infinity]}}
in function gen_server:call/3 (gen_server.erl, line 212)
in call from world:all/1

18> world:add_dodos(1).
** exception exit: {noproc,{gen_server,call,[dodos,{start_child,[]},infinity]}}
in function gen_server:call/3 (gen_server.erl, line 212)
in call from world:add_animals/2

Once the last dodo is killed… no more dodos can be created. Look how our world ended up looking in observer:

Cockroaches… Cockroaches everywhere!

What’s going on here?

I will repeat my usual message one more time, because I still think it’s useful: Don’t read below this paragraph without previously tried to understand the issue on your own. Go and try it, I’ll be right here…

OK, now that you tried on your own, let’s see if you found the same thing that Marc discovered…

Restart Intensity

Let’s talk a little bit about supervision configuration first. In particular I would like to show you this paragraph from Erlang docs:

To prevent a supervisor from getting into an infinite loop of child process terminations and restarts, a maximum restart intensity is defined using two integer values specified with keys intensity and period in the [sup_flags()] map. Assuming the values MaxR for intensity and MaxT for period, then, if more than MaxR restarts occur within MaxT seconds, the supervisor terminates all child processes and then itself. intensity defaults to 1 and period defaults to 5.

Let’s check our animals_sup module:

As you can see, we used 5 for intensity and 1 for period. That means that if 5 restarts happen during a single second, the supervisor itself will die. We had 10 dodos, we killed them all. That clearly demands more than 5 restarts in a single second.

So, that’s why the dodos supervisor died! Problem solved! Right?

Not so fast… we had 7 cockroaches, too. We also killed them all. That also required more than 5 restarts in a second. Why was this supervisor restarted and the one for the dodos was not?

Restart Strategy

We determined that both supervisors (cockroaches and dodos) died, but only one was restarted. The process in charge of restarting them is world_sup. Let’s check it out:

Can you spot the difference?

Both supervisors are defined in the ChildSpecs part of the supervisor definition returned by init/1. The difference between them is in the restart field. While dodos are transient, cockroaches are permanent.

But, what does that mean? Let’s check Erlang docs again:

restart defines when a terminated child process must be restarted. A permanent child process is always restarted. A temporary child process is never restarted (even when the supervisor’s restart strategy is rest_for_one or one_for_all and a sibling’s death causes the temporary process to be terminated). A transient child process is restarted only if it terminates abnormally, that is, with another exit reason than normal, shutdown, or {shutdown,Term}.

Ok, then, since cockroaches is a permanent supervisor, it will always be restarted. That makes sense.

Now, what about the dodos? Well, that supervisor is terminated because it reaches it’s max restart strategy. That looks a lot like an abnormal termination. Nevertheless, it’s not restarted. Why?

SASL to the Rescue!

Remember I told you I added sasl as a dependency of my app. Well, sasl not only prints out progress reports, it also displays crash reports when the processes are terminated. I intentionally removed them from the console snippets above, but let’s see what they have to say now that we know what we’re looking for…

1> world:start().
...
=PROGRESS REPORT==== 29-Oct-2016::01:30:30 ===
application: world
started_at: nonode@nohost
2> world:add_dodos(6).
ok
3> world:kill_dodos().
=SUPERVISOR REPORT==== 29-Oct-2016::01:32:19 ===
Supervisor: {local,dodos}
Context: child_terminated
Reason: killed
...[5 similar child_terminated messages]...[true,true,true,true,true,true]=SUPERVISOR REPORT==== 29-Oct-2016::01:32:19 ===
Supervisor: {local,dodos}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.127.0>},
{id,animal},
{mfargs,{animal,start_link,[dodos]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,worker}]
4>

We have 5 processes killed with reason killed. Then the 6th one is killed with reason reached_max_restart_intensity. But we have no report of how the dodos supervisor was killed. Considering the docs shown above, it seems like the reason for termination of the supervisor was normal, shutdown or {shutdown, Reason}. SASL shows no reports about it, because it terminated normally.

Let’s look at the logs for cockroaches to see if we can find more info…

4> world:add_cockroaches(6).
ok
5> world:kill_cockroaches().
[true,true,true,true,true,true]
=SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===
Supervisor: {local,cockroaches}
Context: child_terminated
Reason: killed
...[6 child_terminated messages]...=SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===
Supervisor: {local,cockroaches}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.149.0>},
{id,animal},
{mfargs,{animal,start_link,[cockroaches]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,supervisor}]
=SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===
Supervisor: {local,world_sup}
Context: child_terminated
Reason: shutdown
Offender: [{pid,<0.125.0>},
{id,cockroaches_sup},
{mfargs,{animals_sup,start_link,[cockroaches]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,supervisor}]
...
6>

In this case, since the cockroaches supervisor is permanent and that means it has to be restarted regardless of the reason why it’s terminated, we do have a report from sasl. That report lets us to see that the reason for cockroaches supervisor termination is, in fact, shutdown.

That may seem unexpected at first glance. Considering that one of the goals of the supervision tree is to be able to escalate issues so that higher levels in the app tree can better deal with problems in lower levels, I would’ve expected world_sup to restart dodos supervisor after dodos determined it was unable to deal with too many children restarts and killed itself.

The OTP team probably had a good reason for this implementation. I just don’t know it. So, if you know the reason, let me know in the comments below, and remember:

Don’t use transient supervisors if you want your parent supervisor to restart them in case of max_restart_intensity.

--

--