The Unstoppable Exception

This will be the first of hopefully many blog posts, all of them devoted to present some particular curiosities and strange things I’ve found in my life as an Erlang developer. I would like to start with one that I faced at least 3 different times in my life.


Debugging >> Fixing

How many times have you said or listened the following sentence?

It took me one whole day to debug the issue, but then I fixed it in literally 10 seconds with just one line of code.

In my ~15 years of development work, I’ve said that countless times. And I found out that it’s something that almost all developers can relate to. The harder it is to debug an issue, the simplest the solution it requires.

Let me tell you now about one particular Erlang issue which lead me to that sentence multiple times…


Chris Prine / Denzel Washington — Unstoppable (2010)

The Issue Itself

In most Erlang/OTP applications, if you’re going to have gen_servers running, you usually start them as supervisor children. But, some times, you actually need to start those gen_servers by hand and deal with the result of gen_server:start_link(…).

The Docs

That’s when you go to the Erlang docs and read the following about the relationship between start_link/3,4 and init/1:

If Module:init/1 fails with Reason, the function returns {error,Reason}. If Module:init/1 returns {stop,Reason} or ignore, the process is terminated and the function returns {error,Reason} or ignore, respectively.

The Dumbest Server

To showcase my problem, let me show you the dumbest possible way to duplicate a number in Erlang while also showing you the simplest possible version of a gen_server implementation:

So, when we want to duplicate a number, we start a new gen_server, we keep the result on its state and then we use sys:get_state/1 to retrieve it.

Let’s try it on a console:

1> c(dumb_math).
{ok,dumb_math}
2> dumb_math:dup(1).
2

The Unexpected Exception

So far, so good. Now, according to the doc we read before, if init/1 function fails, gen_server:start_link/3 should return {error, Reason} and therefore we can expect line #5 in our code to produce a badmatch exception. But instead of that, look what happens…

3> dumb_math:dup(bad).
** exception exit: badarith
in function dumb_math:init/1 (dumb_math.erl, line 8)
in call from gen_server:init_it/6 (gen_server.erl, line 328)
in call from proc_lib:init_p_do_apply/3 (proc_lib.erl, line 240)

Ok, that’s odd, but in any case we can actually use the second, maybe cleaner option that the docs offer. That is to return {stop, Reason} instead of just failing. And we can also do something with gen_server:start_link’s return value instead of just letting the pattern-matching crash. Something like this:

This time we added a function clause to our init/1 function that returns {stop, notnum} if the input is not a number and therefore can’t be multiplied by 2. We also dealt with that situation on dup/1 by handling the {error, Reason} result described by the docs above.

Let’s see how that goes…

1> c(dumb_math).
{ok,dumb_math}
2> dumb_math:dup(1).
2
3> dumb_math:dup(bad).
** exception exit: notnum
4>

Ok, notnum is there, but it’s still an exception, you see? I was expecting notnum to be the result of that function.

Catch me if you can!!

In any case, we know how to deal with exceptions, don’t we? Let’s see…

3> dumb_math:dup(bad).
** exception exit: notnum
4> catch dumb_math:dup(bad).
** exception exit: notnum
5> try dumb_math:dup(bad) catch Type:Error -> {Type, Error} end.
** exception exit: notnum

As you can see, there is simply no way to catch that exception.


What’s going on here?

I’ve got to admit that, for experienced Erlang devs, this example in particular makes things much easier to understand than the ones I stumbled upon in The Real World™. In those cases, most of the time the call to gen_server:start_link/3,4 was hidden inside a function that was called by another which in turn was called by yet another one and so on… with the outermost one including a try…catch.

In any case, for those of you that haven’t yet figured out what’s going on, I recommend you not to read the following paragraph until you’ve actually tried really hard to understand the issue yourselves. Figuring out how can we have an exception that’s impossible to catch will bring you clarity in many other just-erlang-things that are actually really useful to know and I will not describe below. For instance, considering that we couldn’t catch that exception, how are all of our supervisors not one step away from crashing deadly in case of wrong input arriving to their children’s init/1 functions?


Anyway, for the lazy ones out there, here’s a summary of what’s going on here: The function gen_server:start_link/3,4 actually starts and links a gen_server process to the process in which it was evaluated (in our case, the shell). When init/1 crashes or returns {stop, Reason} (as the documentation clearly states) the process is terminated. What the docs do not say is that the process is not terminated normally. Instead it’s terminated with Reason as the termination reason.

You can see in the Erlang docs that:

When a process terminates, it terminates with an exit reason as explained in Process Termination. This exit reason is emitted in an exit signal to all linked processes.

…and…

The default behaviour when a process receives an exit signal with an exit reason other than normal, is to terminate and in turn emit exit signals with the same exit reason to its linked processes.

Now, since the gen_server is linked to our shell process, and it terminates with an exit reason other than normal, the shell will receive an exit signal and, in turn, terminate as well with the same reason way before doing anything else (like pattern-matching on the value of gen_server:start_link(…) or catching exceptions). And that’s the exception exit we see there.


Ok, but then… how do supervisors work? Why are they not terminating all the time in these scenarios? And, if the calling process is about to terminate anyway what’s the point in returning a value from gen_server:start_link/3,4 at all?

Well, there is actually one way to actually be able to check the output of that function without dying. And, again, it would be much more fruitful if you try to find it yourself before reading the following paragraph ;)


In any case, it’s all in the same docs I linked above. The key is the trap_exit flag, as you can see here:

6> process_flag(trap_exit, true).
false
7> dumb_math:dup(bad).
notnum

And that’s all. I hope you have learned something today and I actually hope that writing this blog post helps me remembering these things the next time I face the same issue :)