Troubleshooting Elixir crashes

Bruce Pomeroy
4 min readMay 23, 2017

--

Elixir and Erlang have a well-earned reputation for stability and fault-tolerance. So I was surprised the first time I received a notification that my app was “down”. It turns out Elixir can’t protect you from yourself and I’ve found a couple of ways to bring down an Elixir app. Good news is, if you can diagnose the problem you can fix it pretty quick.

If you’re app just went down here’s how you can troubleshoot.

ssh in to your web server, cd into the directory you deployed your app to then run ls -l

... Feb 28 04:44 bin
... Apr 7 16:04 erl_crash.dump
... May 23 19:28 log

... Feb 28 02:59 erts-8.2
... Feb 28 04:44 releases

You’ll see the directory contents along with the modification date. Pay attention to erl_crash.dump and log. If the modification date on erl_crash.dump corresponds to the start of the outage (remember that the server’s clock may not be in your timezone) then erl_crash.dump will likely contain some useful info. There’s a guide here to interpretting the contents of this file: http://erlang.org/doc/apps/erts/crash_dump.html.

If the modification date of erl_crash.dump doesn’t correspond to the crash date then next stop is the logs, cd into log and use ls -l to find the log file with the most recent modification date then tail it with tail -f -n100 <logfile name>

Hopefully you found some useful info in erl_crash.dump or log/*.

One thing I’ve run into is that supervisors restart processes after they crash but by default they’ll only restart a process three times in a five second window. You may have tested in a controlled scenario that your supervisor will restart the process but in production you may get scenarios where it crashes too frequently and the supervisor gives up, if this is the case you’ll see a message like “Maximum restart intensity reached”. You can read more about “restart intensity” and how to change it here. Search for :max_restarts and :max_seconds.

In general though if you’re having to increase the restart intensity of a supervisor it’s a bad sign. It can be ok to “let it crash” and rely on process supervision but it needs some careful thought. It only makes sense if restarting the process will resolve the problem. There are many cases where it won’t. For example, if your process crashes when a database is unreachable you have a problem. Restarting your process isn’t going to fix the database connection. On restart the database will probably still be down and the process will crash again until max restart intensity is reached. In some cases increasing restart intensity could be a stop-gap measure to keep your system running until you can resolve the issue but it’s not a good road to go down.

The erl_crash.dump or the log may yield some useful info but if not the Beam may have been killed externally.

Ubuntu and I expect other OS will kill processes when the OS begins running dangerously low on memory. A service called OOM Killer (Out of Memory Killer) will kill processes when memory becomes critically low. We can view the activity of OOM Killer with `tail -f -n100 /var/log/kern.log`

Here we can see an example of OOM Killer having killed the BEAM process. Note the last few lines:

Jan 01 00:00:00 ip-0-0-0-0 kernel: [7250834.152637] Out of memory: Kill process 22397 (beam) score 81 or sacrifice childJan 01 00:00:00 ip-0-0-0-0 kernel: [7250834.157643] Killed process 22397 (beam) total-vm:2301796kB, anon-rss:166036kB, file-rss:0kB

This doesn’t necessarily mean that BEAM is “to blame” for memory consumption. Every process consumes memory and it doesn’t really make sense to blame a specific process. There’s some method to how OOM Killer chooses it’s victim but that’s not really important. What is important is to keep memory usage in a reasonable range so OOM Killer never needs to step in.

In the log OOM Killer lists the OS processes along with their memory consumption at the moment it killed BEAM. Check out the rows highlighted in blue in the image above. There is a BEAM process that’s using substantial memory but that’s expected. What’s more suspect in this case is numerous convert (an ImageMagick tool) processes that are all using substantial amounts of memory. This bears further investigation, is it expected or necessary for that many ImageMagick processes to be running concurrently? This is just one example. Other processes can use a lot of memory, PhantomJS, Redis, RethinkDB, anything else you might be running on that server. Obviously it may make sense to run your app on a server with more memory or to move a memory-hungry process onto a separate machine, that’s up to you and what makes sense for your app.

OOM Killer uses SIGKILL to terminate the process, not SIGTERM. Unlike SIGTERM, SIGKILL kills the process without giving it the opportunity to respond. If you’re using PID files to monitor your process you may find your system isn’t notified of an OOM killer termination. You can simulate OOM killer with pkill -9 beam.smp If you’re using pidfiles, you’ll find they are still there even after you kill the process in this way.

--

--

Bruce Pomeroy

Full-stack developer, specializing in Elixir and React