The Poison Pill Process

A Harrowing Tale from the Kingdom of Elixir and OTP

I’m not going to tell you the moral of this story right now.

Instead I hope this story speaks for itself.

The moral of the story is this: there is wisdom in studying OTP design patterns early on in an Elixir career, and risk to ignoring them.

The Kingdom

The kingdom of elixir and otp is a bright and happy place. For example, there is stability. Colloquially this Peace is referred to as the time of Fault Tolerance. That stability was one of the founding goals of the kingdoms predecessor and neighbor Erlang. It was forged directly within the kingdoms first principles, and many of its subsequent founding decisions.

The kingdom is made of may isolated parts, called processes. These processes are isolated at their very core, known as their heap and stack. If one succumbs to disease or failure, it does not spread to another.

While they are isolated, these processes communicate frequently with each other. They each have their own mailbox, they request and send goods in the form of data through a beautiful system of message passing.

For any system to be more than the sum of its pieces, certainly to be a kingdom, requires a special gift — that of effective coordination. To this end the kingdom has what is lovingly referred to as the Scheduler. The scheduler ensures that all of these processes work in the most effective manner possible, and utilize their precious natural resources, The Cores, as efficiently as possible.

This kingdom is not a fairytale, however. Over time there have been many ways for an individual Process to suffer greatly and fail, entering a time known as the Crash.

To support stability in the realm, the kingdom created a vast library of systems interaction and knowledge, known as the OTP. From the OTP came many gifts of greatness. It taught processes how to structure themselves maturely, and provided many standards for behaving well. OTP taught these processes that there were times when they should behave as a client to others, and times when they must become a server. These processes became known as the GenServer, able to serve up the handling calls, casting, and info.

But there was something possibly even greater than the refinement of these processes, given to the kingdom from OTP. It has come to symbolize the Fault Tolerance as a physical manifestation.

It is the Supervision Tree.

The Supervision Tree is a great being. It extends its vast network of roots known as Supervisors, to monitor many of the most crucial processes in the land. And should one of these processes succumb to the Crash, the Supervisor can give the gift of rebirth to the process, so that it may rise up from the ashes like a great Phoenix. In this way, all of the other processes in the kingdom that interact and depend on that processes can progress forward peacefully.

Still, despite these principles and gifts, the kingdom is at the whim of its designers, its programmers of OTP, its architects, the people that make the kingdom work.

The Fall

It was here, in this kingdom, a long, long time ago, that there was invented a certain kind of Process. This Process was innocent in its guise but monstrous in its potential, that had within it the power to bring the entire kingdom to its knees.

Unfortunately for the realm, this process was in many blog posts. Its popularity lead to grow within several Production Systems.

It was common practice in those dark days to take advantage of message passing for a Process to send itself a reminder. This reminder was usually a task it must fulfill in the future. To learn about this pattern we must concoct a new elixir project with a supervision tree:

$ mix new poison_pill --sup

Like most of elixir’s popular frameworks, our project is under the watchful eye of a great Supervision Tree. Behold, within the poison_pill/lib/poison_pill/application.ex module, we can glimpse the application supervisor:

The Great Supervision Tree

This process could do many things. In this particular tale, it was a simple worker to keep a local cache of data gleaned through the API. This process was no lightweight. Indeed, it was an OTP GenServer. When it was born, this GenServer would send itself a message to query the api, and schedule a query for the future. The beginnings of this GenServer were humble:

$ cd poison_pill
$ touch poison_pill/lib/poison_pill/worker.ex
Baby Steps to create worker.ex

For this GenServer to come alive, the worker in the application.ex file must be uncommented, saved, and fired off with a mix run:

$ mix run
Compiling 1 file (.ex)
HELLO WORLD

This tale is just starting to get interesting — we must send ourself a message to sync with the API. We’ll allow ourselves to send the message right away, and then we’ll update our cache every 24 hours:

GenServer Skeleton in worker.ex

Building a new process is ambitious and perilous, we must do one more check to make sure the Compiler is pleased, and our Process behaves as expected:

$ mix run
Compiling 1 file (.ex)
HANDLED ALL THAT INFO

For the good of the kingdom, and the glory of our process, it is time to retrieve all the datas. We’ll add an http client in our mix.exs file (p.s. side note — the poison in httpoison is unrelated to this blogs title, and the library is a super great high level http client:

The Mighty HTTP Client

And download / install the package with:

$ mix deps.get
Resolving Hex dependencies...
Dependency resolution completed:
...blah
...blah

Now let’s power up our server by completing the sync:

API Sync in worker.ex Server API

We can verify our unbreakable, fault tolerant, indestructible OTP application will work by running:

$ mix run --no-halt

WAIT A SECOND.

Why did our application exit? So what if we put a typo on our URL. It’s not that bad of a runtime error, isn’t this system supposed to “let it crash”?

How did our tiny, innocent little process bring down the kingdom’s great Supervision Tree? Should the process just simply be restarted to try again?

Well, we can employ the kingdom’s magical detective (the Programmer) to investigate.

The default child spec we used in our application.ex file is the :one_for_one supervision strategy. This means that each time the worker goes down, our supervisor will just restart that single processes, without worrying about adjusting any other processes the kingdom may rely on. This seems about right, we don’t even really have any other workers in our application.

How is this evil poison pill process casting its black magic?

Let’s dig more into the massive library of knowledge we have accumulated known as the OTP.

It turns out, the most basic Implementation relies on quite a few defaults. For example, if we check out the elixir docs on Supervisor, they take a :restart atom as an option. We didn’t pass one it, but it looks like the default is :permanent. This also seems like what we want: it will inform the great Application Supervisor to keep trying to resurrect our great Phoenix of a Process.

What dark secret continues to hide in these depths?

Ah Ha! It appears that our application supervisor applies even more defaults on start up. According to these docs, it takes :max_restart and :max_seconds argument. By default our supervisor will attempt to bring back our process a maximum of 3 times every 5 seconds.

Indeed, if we trace our logs, we see that the :econnrefused runtime error occurred a 4th time right before the entire application crashed.

The culprit, it seems, is our innocent schedule_work(:now) function. It attempts to sync with an api immediately on boot, crashing right away, rinse and repeating, until it drags our entire application down with it.

This is even more insidious when we consider that no matter how far down the supervision tree this process is, layered in depths of supervisors, they could cascade in failure trying to reboot the Poison Pill.

This leads us to the moral of this dark fairytale.

“Let it crash” may be fun to say, or to lob at the languages that require excruciating amounts of explicit error handling.But moving fast and NOT breaking things is what Elixir ultimately offers.

In respect of that, we should be very very careful in the design of our beautiful OTP processes to not let them crash rapidly, especially when a process is initializing or early on in its lifecycle.

In particular, if we’re calling methods with a bang (!) symbol, as the caller we must be very careful to understand they can return an exceptional exception, and wrap them in a try / rescue block.

If we’re using pattern matching control flow on the return values, such as {:ok, value} / {:error, value} return tuples, we must be very careful as the function callers to pattern match on every possible return type to avoid a match exception.

If we cannot escape doing work rapidly in a process lifecycle, we must also think about the behavior of our process, and consider if we need to more carefully choose the supervisor’s behavior supporting it.

The designers of this poison pill can render it ineffectual, by being thoughtful about the HTTPoison.get!. Indeed, we want to take different actions based on the return value of the call. This means we are introducing control flow.

Try / rescue blocks aren’t generally intended to be used as control flow in Elixir, rather they were created to to protect the Process against truly exceptional catastrophes. We can use the non-bang version of HTTPoison.get instead, and wrap it in a case statement:

Safely wrapped handle_info/2

With our kingdom’s design restored we shall label the older poison pill process an anti-pattern, so that others would know not to repeat it.

But the world is dark and scary. And no matter how careful we are, a kingdom may come violently crashing down.

It was with this in mind, in preparation against the darkness, that the kingdom embarked on the great journey towards Distribution.

To be continued…