How to start and stop Erlang daemon

5 min readMay 15, 2018

Erlang (the platform created for people to sleep at nights and rest on vacations instead of fixing bugs) is shining brightly when it is running and managing software, written on it. It looks like a separate OS under hood, it is even possible to launch Erlang without OS at all: http://erlangonxen.org/ In this article I’d like to discuss problem of booting Erlang VM because it is not trivial and very counterintuitive comparing to plain old Unix daemons like nginx or postfix.

What do we need to start unix daemon?

Validate config. Usually there is no use to start software and then complain about invalid config to logs because nobody reads them.
Change user/group. It is a common practice to boot daemon as root and then switch to some user including artificial isolated nouser/nogroup to protect from possible attacks. Launching from root maybe used to listen to ports below 1024. Linux offers setcap 'cap_net_bind_service=+ep' /path/to/program to allow non-root to listen to port 80. In case of erlang you will have to set this flag on beam.smp and do not forget to do it after each upgrade of runtime.
Check if another instance of daemon is already running. This problem is old as multitasking in OS exists. Very hard to distinguish between different copies of the same program and different instances of same code on disk that are serving different purposes. For example, you may want to launch one nginx for your web server and another nginx for some managing API. It will be two different instances launched from the same file on disk.
Prepare all required directories like log files or complain that they do not exist. Different approaches, both are valuable.
Launch software and make so that it do not hold terminal (daemonize it)
Ensure that it is working or have some runtime issues when it cannot write errors to output. Ensure that pidfile is written and daemon is responding to commands

This list of tasks is what do we do to start Flussonic and let’s describe it a bit deeper. To do this you can install Flussonic from our website: https://flussonic.com/doc/installation and ask for trial key to test it: https://flussonic.com/trial Or do not request a key and just look at scripts.

Changing user

Not a very often thing for us, but we give a hint to admin how to change running user of daemon. This is done with a rather primitive call:

exec su “$user” $0 $*

Validating config

This is done in /etc/init.d/flussonic. First step is to run:

./contrib/validate_config.erl || exit $?

Here we can do a lot, but frankly speaking do not so much. Find config file, read it and do some validations. This is a good place to check if you have SSL certificate while asking to add https port and so we do.

It is a very good practice to run such code before starting daemon.

Checking aliveness of daemon

There are several ways to check if erlang daemon is alive:

Ask about it from epmd: epmd -names or net_adm:names({127,0,0,1}) Second form is better to erlang escript. You should know that epmd can sometimes show node that do not exists or forget about existing node.
Check pid file that you have written before. If you have left it from previous boot of Linux, it may point to another daemon that is alive. This is a complicated thing, be careful to implement it. We are just checking if pid exists and kill process that is running under it. Not a best way, sorry:

if [ “$PID” != “” ] ; then
  if kill $PID >/dev/null 2>&1 ; then
    echo “Killing old flussonic instance with pid $PID”
    echo “Killing old flussonic instance with pid $PID” >> /var/log/flussonic/flussonic.log
  fi
fi

3. Ask something in /var/run/flussonic/erlang.pipe.1.w We launch flussonic with run_erl so it is possible to speak to erlang shell: write something there and possibly receive. We have refused from this method, because your shell script may just block if no process is reading this pipe.

4. Make HTTP API call to known port. Read config, find port and make a call there. Not suitable for us, because sysadmins like to “configure firewall” and then ask why nothing works.

Launching erlang daemon

We use run_erl for it. It works, it allows to reach erlang shell when erl distribution for remshis not working, but there are some caveats. This is done in /opt/flussonic/bin/flussonic

run_erl can write log files. Use it, because it records output of erlang node including different io:format and output of child processes that goes out of logs.
run_erl writes log synchronously! If you use some tool like lager and leave verbosity level info or debug you can ruin your server: sync after each log line is too slow even for SSD. Reduce verbosity level of console to warning or error

Checking that is has launched

It is impossible to call function daemon from erlang, so we have to do some artificial steps to check that all is ok.

After calling run_erl we start waiting for appearing pidfile:

i=0
while ((i < 100)) && [ ! -f “$PIDFILE” -a ! -f “$ERRORFILE” ] ; do 
  echo -n “.”
  ((i = i+1))
  sleep 1
done

Pidfile is created from erlang code only after everything has started (license checked, listeners started).

There is some small trick I want to share. It is impossible to write from run_erl-ed erlang daemon to admin console, so we write message to boot log:

if [ -f “$PIDFILE” ] ; then
  echo “done”
else 
  if [ -f “$ERRORFILE” ] ; then
    echo “failed: `cat $ERRORFILE`”
    rm -f $ERRORFILE
    exit 1
  fi
fi

Here we can write something like: “your license is not good” or “your config became invalid right after validating”

Voila, erlang daemon has launched. Do not forget to write pid:

file:write_file(PidPath, os:getpid())

Now we need to stop all this

Stopping erlang process is not a trivial task. When things go easy, all is done seamlessly, but we experience very wild configurations that invent our customers. It is absolutely ok to configure hostname pointing to some IP address that do not reply with “host unreachable”. Any call to resolver may take several minutes. We literally deploy to a minefield.

we connect to Flussonic via erlang distribution. There are tricks because we need to be sure that our control tool has an erlang name which is not already in use. One cannot just generate random name because it may overflow atom table, so try to select unused name and connect to Flussonic: net_adm:names({127,0,0,1}) can help to find existing nodes.
we call rpc:call(Flussonic, flussonic, halt, []). This is not a pure halt, but something like clean stop: we write to log that Flussonic is being stopped by outer call and write something good to boot error log (see before), then sync our logger to disk and halt.
Do not assume that erlang:halt(0) will immediately stop your erlang daemon =) Ha-ha! Use erlang:halt(0,[{flush,false}]) if you do not want to listen for “why is your software not stopping”. Sometimes it is hard to explain that it is a bad idea to mount logdir on sync NFS and disconnect cable, so use {flush,false} option.
If it is still alive, read pid file and kill process. Wait couple of seconds and kill -9 If it is still alive, write email to sysadmin to change NFS to softmount and check hardware.
Also it is possible to write erlang:halt(0,[{flush,false}]) to erlang.pipe.1.w but as I’ve already told, it may be blocking operation.

So this seems to be enough to stop erlang daemon in wild production where hostnames, DNS, NFS and the rest are configured by guerilla sysadmins.