Reliable Software doesn’t crash :

  • Hardware has errors
  • The network can be unreliable
  • Programmers make mistakes!
  • The software has bugs( in very rare cases software has bugs as well )

Perceptions vs Reality

Instead of denying reality, we Acknowledge that software has bugs, the system may fail

Instead of having uncontrolled failures, Erlang turns failures, exceptions and crashes into tools that we can use and control.

Control

  • It's all about how to control failures
  • If failures are controllable, they stop being a scary event!

Will see How Erlang Controls Transient Failures

Process in Erlang :

  • The basic building block in Erlang is the process. A process is fully isolated and shares nothing i.e you can’t access to its memory location directly( Actor Model )
  • If a process dies, it won’t propagate the failure to other processes.
  • An Erlang process is very lightweight and we can spawn 1000 ‘s of them easily.

Messages:

  • Message passing is the only way how processes can communicate with each other
  • As I already mentioned, message passing in Erlang is a non-blocking operation.

This means communication is always asynchronous .you send a message and go back to your normal activity. When needed, you can check if you got a reply back in your mailbox.

If you send a message and then die, the receiver will still get your message

Process dependency

  • In reality, processes are not independent of each other.
  • As soon as we start a communication, we have an implicit dependency

Process A sends a message to process B.

Process B dies without responding.

Process A can wait forever and block (block!!!) ….or give up after some time

But what “some time “ means? What if process B is just taking too long to reply ??

Monitor

  • The monitor is being an observer
  • You keep an eye on a process, and whenever it dies you receive a message in your mailbox
  • Besides this, you can react and make a decision about what to do
  • The observed process will not know it is being observed

There is one more mechanism which we called Link

Link

  • Links between processes are bidirectional.
  • When a process dies, the linked process will receive an exit signal and it turns it will be killed as well. (in simple words …. If one process dies it will big bring down another one )
  • Process links provide us with an architectural construct that ties multiple processes together so that they will fail as a unit.

Link*

But sometimes we don’t want a bidirectional relationship

  • Not every process is equal and we want the case that if one (master) process die, all others (Childs) will die with it but not the other way around

In Erlang, it is possible to define some process are special.

These special processes can trap the exit signal and convert it into a message. In this way, they can be informed and recover the failure eventually

This feature turns out to be essential to building supervisors.

Supervisors

  • Supervisors what’s the supervisor ???…supervisor is just a special process and it only does one thing it just has the job to start the processes keep an eye on it monitors and link to it and restart them in case of failure
  • Supervisors can also start/monitor/restart other supervisors
  • Normally a supervisor can start a bunch of processes and those processes can be another supervisor

That’’s when we an Erlang called we have a supervision tree so we have master supervisor it can start all the other supervisors and supervisors all the way down like a tree and end up in a worker generic process this job we call it as a worker

Supervisors in Erlang have different strategies about the restarts is

One for One : If a child process terminates, only that process is restarted.in the below image eg : when process number 2 dies only that process is restarted

One for All: If a child process terminates, all other child processes are terminated, and then all child processes, including the terminated one, are restarted. eg: process 2 terminated, and then all child processes, including the terminated one, are restarted

Rest for One: If a child process terminates, the rest of the child processes (that is, the child processes after the terminated process in start order) are terminated. Then the terminated child process and the rest of the child processes are restarted.

Also, one characteristic of the configuration of supervisor about the tolerance and we call it “crash” intensity

Supervisor tolerance

  • Supervisors can have a configurable tolerance or crash intensity ( what it means it's very simple that the supervisor will allow a number of crashes between a time frame whenever a worker or one of its worker crash more than this tolerance the supervisor itself exit )
  • We can configure a supervisor to allow N crashes within T time frame.

If the number of crashes is higher than the tolerated intensity, the supervisor will quit, kill all the workers and itself.

So Basically means the in Erlang will let the process crash and we will have a system building blocks that will do the restart but does it work ????

It depends on the type of failures you have to enhance

  • not all failures are equal
  • Some failures, like crashing due to data corruption, restarting will not help at all restart the process it starts over again and we’ll crash again because you have a problem that the restarting will not solve it
  • Other failures they are transient like a network connection reset, restarting may help
  • Failures can be Repeatable or Transient (network failures, the cleaning lady pulled apart the plug these are the transient failures we don’t know why those occur

Repeatable failures are usually easy to find and fix them.

with proper QA, normally they won’t reach production

Even if they reach production, it will be possible to detect them and patch them

Transient Failures are very harder to find and fix( these are the nasty one’s )

Because they are rare to occur, normally it will only show up in production

A failure with 0.01 % of occurrence rate with 10M Daily Active users, it means it. will occur to at least 1000 users each day !!!!!

Restarting may not work for Repeatable failures:

  • on core features, restarting is useless,
  • on Secondary features, it depends .. it is not important and not used very often, it can be acceptable to restart and move on

The restart will Work for Transient failures:

. But Restarting is the key of all our Erlang System it’s extremely effective

  • Hard to reproduce usually means it will only occur under some circumstances
  • Restarting or repeating the procedure may make the failure disappear
  • Logging, tracing or using other introspection tools whenever the rare failure occurs can help developers to fix them later on

IN Erlang as you know by now we have a very very strategy to deal with but with errors and it will move us to happy path programming

Happy Path Programming :

  • Traditionally crashing is a bad thing
  • with Erlang crash is a tool that we can use for system design
  • This practice will normally lead to “ happy path programming “
  • happy path programming is not doing defensive programming. basically we don’t code any “error handling” code only handle “good values”
  • whenever there’s an error ( or unexpected data) we crash the process and this will stop the processing immediately

--

--