Message order and delivery guarantees in Elixir/Erlang

TL;DR Some messages may be not delivered, but message order is always preserved relative to two processes. Seriously, read full article, this cannot be extracted to one sentence ;)

Gaspar Chilingarov
Learn Elixir
4 min readSep 29, 2017

--

I cannot stress enough how important it is to understand concurrent processes interaction to ensure that your Elixir or Erlang program doesn’t have hidden race conditions or wrong assumptions. This e-mail from Erlang mailing list pinpoints it to the core.

The question was:

As I understand it, message delivery is not guaranteed, but message order IS. So how, exactly does that work? What’s the underlying mechanism that imposes sequencing, but allows messages to get lost? (Particularly across a network.) What are the various scenarios at play?

And the answer below (highlights are mine):

This is sort of backwards.

Message delivery is guaranteed, assuming the process you are sending a message to exists and is available, BUT from the perspective of the sender there is no way to tell whether the receiver actually got it, has crashed, disappeared, fell into a network blackhole, or whatever. Monitoring can tell you whether the process you are trying to reach is available right at that
moment, but that’s it.

The point is, though, that whether the receiver is unreachable, has crashed, got the message and did its work but was unable to report back about it, or whatever — its all the same reality from the perspective of the sender.

That also means that it is not enough to send confirmation. Because original sender may miss it. And if you need confirmation to confirmation, then you are on a slippery slope of having “turtles all way down” :)

“Unavailable” means “unavailable”, not matter what the cause — because the cause cannot be determined from the perspective of the sender. You can only know this with an out of context check of some sort, and that is basically the role the runtime plays for you with regard to monitors and links.

The OTP synchronous “call” mechanism is actually a complex procedure built from asynchronous messages, unique reference tags, and monitors.

What IS guaranteed is the ordering of messages *relative to two processes*.

If A sends B the messages 1, 2 and 3 in that order, they will certainly arrive in that order (assuming they arrive at all — meaning that B is available from the perspective of A). If C sends B the messages 4, 5, 6 in that order those will also certainly arrive in that order for B. If A sends B and C the messages 1, 2 and 3, and as a reaction C starts sending B the messages 4, 5, 6 — we can never know what order of interleaving these will have.

Unfortunately during unit testing Erlang scheduler usually schedules them more or less in same order and you rarely see messages arriving in wrong order. That’s why it is useful to run all ExUnit tests with async: true and random seeds to expose scheduler to different load and may be catch possible race condition. If your tests show flaky behavior — it is sign of possible message ordering problem.

It could be [1,2,3,4,5,6], or [1,2,4,5,3,6] or [1,4,5,6,2,3] or whatever, but only the relative ordering between a pair of processes can be known.

A digression about design implications…

One magical side effect of these strict guarantees AND strict ambiguities is that right from the start of a project in Erlang, even one running on a local system, you wind up staring the CAP theorem straight in the face.

This tends to result in a better understanding of the constraints introduced by concurrency and distribution because they are present in the mind of every developer right from the start. The general outcome I’ve noticed (but don’t know how to quantify with a metric of any sort) is that consideration of design tradeoffs rules architecture, even on a subconscious level, and this really bears itself out as a project matures.

-Craig

Same applies to the inter-node communication. Order in which messages sent between two processes is preserved, but message flow may be interrupted at any given moment and skip any number of messages in between.

Follow full discussion in the mailing list here.

About me

I’m Gaspar Chilingarov . I facilitate DevOps transition, help moving legacy applications to cloud and write high-performance Elixir apps.

Need help with your Elixir app or want prototype your next microservice in Elixir? DM me on Twitter or Github.

You can connect with me on Twitter, Facebook, LinkedIn and GitHub.

Found this post useful? Kindly tap the ❤ button below! :) Let’s spread word about Elixir.

--

--

Gaspar Chilingarov
Learn Elixir

I facilitate DevOps transition, help moving legacy applications to the cloud and write high-performance Elixir apps.