Erlang “Let it Crash” Approach to Building Reliable Services

6 min readJul 1, 2020

Reliable Software doesn’t crash :

Hardware has errors
The network can be unreliable
Programmers make mistakes!
The software has bugs( in very rare cases software has bugs as well )

Perceptions vs Reality

Instead of denying reality, we Acknowledge that software has bugs, the system may fail

Instead of having uncontrolled failures, Erlang turns failures, exceptions and crashes into tools that we can use and control.

Control

It's all about how to control failures
If failures are controllable, they stop being a scary event!

Will see How Erlang Controls Transient Failures

Process in Erlang :

The basic building block in Erlang is the process. A process is fully isolated and shares nothing i.e you can’t access to its memory location directly( Actor Model )
If a process dies, it won’t propagate the failure to other processes.
An Erlang process is very lightweight and we can spawn 1000 ‘s of them easily.

Messages:

Message passing is the only way how processes can communicate with each other
As I already mentioned, message passing in Erlang is a non-blocking operation.

This means communication is always asynchronous .you send a message and go back to your normal activity. When needed, you can check if you got a reply back in your mailbox.

If you send a message and then die, the receiver will still get your message

Process dependency

In reality, processes are not independent of each other.
As soon as we start a communication, we have an implicit dependency

Process A sends a message to process B.
Process B dies without responding.
Process A can wait forever and block (block!!!) ….or give up after some time
But what “some time “ means? What if process B is just taking too long to reply ??

Monitor

The monitor is being an observer
You keep an eye on a process, and whenever it dies you receive a message in your mailbox
Besides this, you can react and make a decision about what to do
The observed process will not know it is being observed

There is one more mechanism which we called Link

Link

Links between processes are bidirectional.
When a process dies, the linked process will receive an exit signal and it turns it will be killed as well. (in simple words …. If one process dies it will big bring down another one )
Process links provide us with an architectural construct that ties multiple processes together so that they will fail as a unit.

Link*

But sometimes we don’t want a bidirectional relationship

Not every process is equal and we want the case that if one (master) process die, all others (Childs) will die with it but not the other way around

In Erlang, it is possible to define some process are special.

These special processes can trap the exit signal and convert it into a message. In this way, they can be informed and recover the failure eventually

This feature turns out to be essential to building supervisors.

Supervisors

Supervisors what’s the supervisor ???…supervisor is just a special process and it only does one thing it just has the job to start the processes keep an eye on it monitors and link to it and restart them in case of failure
Supervisors can also start/monitor/restart other supervisors
Normally a supervisor can start a bunch of processes and those processes can be another supervisor

That’’s when we an Erlang called we have a supervision tree so we have master supervisor it can start all the other supervisors and supervisors all the way down like a tree and end up in a worker generic process this job we call it as a worker

Supervisors in Erlang have different strategies about the restarts is

One for One : If a child process terminates, only that process is restarted.in the below image eg : when process number 2 dies only that process is restarted
One for All: If a child process terminates, all other child processes are terminated, and then all child processes, including the terminated one, are restarted. eg: process 2 terminated, and then all child processes, including the terminated one, are restarted
Rest for One: If a child process terminates, the rest of the child processes (that is, the child processes after the terminated process in start order) are terminated. Then the terminated child process and the rest of the child processes are restarted.

Also, one characteristic of the configuration of supervisor about the tolerance and we call it “crash” intensity

Supervisor tolerance

Supervisors can have a configurable tolerance or crash intensity ( what it means it's very simple that the supervisor will allow a number of crashes between a time frame whenever a worker or one of its worker crash more than this tolerance the supervisor itself exit )
We can configure a supervisor to allow N crashes within T time frame.

If the number of crashes is higher than the tolerated intensity, the supervisor will quit, kill all the workers and itself.

So Basically means the in Erlang will let the process crash and we will have a system building blocks that will do the restart but does it work ????

It depends on the type of failures you have to enhance

not all failures are equal
Some failures, like crashing due to data corruption, restarting will not help at all restart the process it starts over again and we’ll crash again because you have a problem that the restarting will not solve it
Other failures they are transient like a network connection reset, restarting may help
Failures can be Repeatable or Transient (network failures, the cleaning lady pulled apart the plug these are the transient failures we don’t know why those occur

Repeatable failures are usually easy to find and fix them.
with proper QA, normally they won’t reach production
Even if they reach production, it will be possible to detect them and patch them
Transient Failures are very harder to find and fix( these are the nasty one’s )
Because they are rare to occur, normally it will only show up in production
A failure with 0.01 % of occurrence rate with 10M Daily Active users, it means it. will occur to at least 1000 users each day !!!!!

Restarting may not work for Repeatable failures:

on core features, restarting is useless,
on Secondary features, it depends .. it is not important and not used very often, it can be acceptable to restart and move on

The restart will Work for Transient failures:

. But Restarting is the key of all our Erlang System it’s extremely effective

Hard to reproduce usually means it will only occur under some circumstances
Restarting or repeating the procedure may make the failure disappear
Logging, tracing or using other introspection tools whenever the rare failure occurs can help developers to fix them later on

IN Erlang as you know by now we have a very very strategy to deal with but with errors and it will move us to happy path programming

Happy Path Programming :

Traditionally crashing is a bad thing
with Erlang crash is a tool that we can use for system design
This practice will normally lead to “ happy path programming “
happy path programming is not doing defensive programming. basically we don’t code any “error handling” code only handle “good values”
whenever there’s an error ( or unexpected data) we crash the process and this will stop the processing immediately

Erlang “Let it Crash” Approach to Building Reliable Services

Written by Vamsi Mokari