Why should every process be supervised?

If you’ve been around in the Elixir / Erlang community long enough, you’ll begin to hear the same wisdom being passed around time after time. Let it crash; don’t use GenServers to separate business logic; and the topic of this blog post:

Every process should be supervised.

But for a long time, I wondered why we should supervise every process. Should we even supervise processes if we don’t care whether they crash, and never want to restart them?

Supervisor’s child spec has three possible values for :restart:permanent:transient, and :temporary. Permanent is the default, and it tells the supervisor that the process should always be restarted, without concern for the manner in which the process exited. Transient processes are only restarted when they exit uncleanly, so they are useful when you want something to run just once to completion, like a Task. Temporary processes, on the other hand, are never restarted.

You might ask yourself, as I did, what is the point of a temporary process? If it’s never going to be restarted, then why bother supervising it at all?

Supervisor has, as it turns out, more responsibility than simply restarting processes that have failed. It is also responsible for shutting down processes when the supervisor is asked to stop, which can happen in the event of a failure, or when the VM is shutting down. So on the very surface, it seems wise to supervise temporary processes, just so they can be terminated at the correct time.

Now let’s talk about failure handling. When a process fails, it is restarted. This “turning it off and on again” is the way OTP tries to recover from a failure. The goal is to get the system into a known-good state.

This same principle applies to supervisors: when a supervisor is unable to start a child process cleanly (when the restart intensity is exceeded), the supervisor crashes and is restarted by its supervisor. This can continue up the supervisor tree until the application itself crashes. So in the worst case scenario, our entire application might be restarted in an attempt to recover from an unknown failure. But what happens if we have been starting unsupervised processes?

Starting unsupervised processes thwarts OTP’s ability to respond to failure.

Since these processes are unsupervised, they are out of reach of the supervision tree. In the event of a failure, OTP cannot shut them down, which means that OTP cannot restore the system to a “known good” state.

The first part of “have you tried turning it off and on again” is turning it off, and if we can’t do that then we can’t recover from failures!

This is why we should supervise every process.

If you are into supervisors and OTP, you should check out Horde, a clustering supervisor I’ve been working on for the past year or so.