Concepts of Erlang

Lately I’ve been learning Erlang (and Elixir) after years of reading surface level stuff about it. There’s a lot out there that teaches you how to write Erlang or Elixir programs, but here I want to write a conceptual post I wish I’d read before I got started. Erlang combines a set of old ideas in a way that’s both neat and has aged incredibly well.

Erlang’s concurrency model in particular uses three concepts that I was already aware of: tail call optimization, lightweight processes, and asynchronous messages.

Tail Call Optimization

A tail call is a recursive call that doesn’t need to keep the current stack frame around, and can therefore be optimized away by a smart runtime. Here’s an example of an infinite loop expressed via a tail call in Elixir (because it’s is easier to read for most folks).

def loop() do
  # do stuff

When this program runs on the BEAM VM, it’s optimized to be more like what a C programmer would recognize as an imperative loop, like “while(true)” or “for(;;)”. This becomes important in Erlang/Elixir because these recursive functions run for a very long time, and form the main loop of most long running erlang processes.

Asynchronous messaging

Erlang processes can message each other, and the messages have a mailbox. Let’s go back to that loop again:

def loop() do
  result = receive do
    {:hello, name } -> name
    _ -> "?"

When this function reaches the receive method, the process goes to sleep until there’s a message in its mailbox. So now this little thing is starting to look a bit like a server.

Putting them together

These ideas weren’t new to me, but Erlang gets a ton of mileage out of them and the ways they combine. Going back to our loop, let’s add a parameter that can store some state we want to keep in this process:

defmodule Counter do
  def loop(value) do
    next_value = receive do
      {:add, number} -> value + number
      {:subtract, number} -> value - number

To create one of these counter processes and send messages to it you’d do:

pid = spawn fn () -> Counter.loop(0) end
send pid, {:add, 12}
# prints 12
send pid, {:subtract, 11}
# prints 1

Now we have a little counter loop, and clients can add or subtract to it as things go. There’s no way to get at the contents except sending messages, and the Erlang VM can create many thousands of these at once because they’re not the threads you might be used to from C++ or Java.


The Erlang scheduler isn’t quite like anything else I’ve come across. Erlang processes are preempted very often and the VM is super careful about how much time a process has to run, which I imagine is because of Erlang’s genesis in telephony. They’re also super lightweight, and if one crashes it doesn’t cause problems for the rest of the system.

In practice this means you spin up a new process to do something in Erlang that you’d have a hard time spawning a new OS thread for. Large erlang programs can have millions of these processes running on a single machine, they don’t share any state with each other, and they all communicate via asynchronous messages. It’s easy to reason about an individual process because these loops depend only on their arguments like any functional program would. Also you don’t have to worry about shared memory or locks, and data races are harder to create.

Why is this interesting?

One small example of the mileage Erlang gets out of this is that the only thing we need to start this loop again is the value of the counter. So if we want to deploy a new version of this loop without taking the system down, the system can start a new version of the function and know that it’s safe. As I understand it this is more or less of how the famous “hot code reloading” feature of Erlang works.

Another thing Erlang uses processes for that I’d not seen before is Supervisors, which are processes that observe other processes and restart them if they crash or run into other problems. Supervisors rely on something I’ve not covered here called process linking. In practice, a large Erlang system’s workers will run happily doing their work, and crash as soon as they get into a bad state. This is a very different and I think much better way to think about errors.