Graceful shutdown on Kubernetes with signals & Erlang OTP 20

One of the general tenets of Kubernetes is that the pods of a service are disposable, and can be started and terminated at any time.

For instance, the Horizonal Pod Auto-scaler scales the number of instances of a service on your cluster according to metrics such as a pods CPU utilization. When used with AWS auto-scaling groups, this can save you (or your company) a shed load of money, by scaling pods, and ultimately nodes out when utilisation is high, and scaling them back in when utilisation is lower again.

Scale-out and scale-in also happens when you make a new deployment, and allows old versions of your processes to be progressively replaced with new ones.

Ideally this all happens without interrupting your service; but in practice, achieving that needs a little help from your application.

Life before termination

In order to ensure that pod termination doesn’t interrupt a service, your application must ensure that traffic is routed away from it before it exits. Kubernetes supports this via its readinessProbe and termination procedure.

The readinessProbe

The readinessProbe is a process that runs regularly to determine if your pod can currently accept traffic: if the probe fails, traffic will be start being routed to other replicas. Note that the pod won’t be killed if it fails the probe, it will simply be ignored for the purposes of traffic routing until the probe starts succeeding again (its companion the livenessProbe is a lower-level check which determines if the pod is alive at all, and in this case failure means that Kubernetes will terminate the pod).

For example, you can use the httpGet implementation to call a URL on your pod to determine readiness: if the request fails to return a success status (≥ 200 < 400) then the pod is not available for traffic.

It’s worth noting that the re-routing due to the readinessProbe isn’t instant: if your probe is configured to run every 2 seconds, then it’s going to take at least that amount of time before you can guarantee that you won’t get new requests, probably more, because the routing parts of Kubernetes (e.g. istio) take time to register the change.

The termination procedure

The termination procedure uses a combination of UNIX signals and an optional preStop hook to communicate to the pod that it’s time to shut down.

Kubernetes first sends a SIGTERM signal to your main process, then calls the optional preStop hook; it then waits for up to 30 seconds for your pod to exit, and if it’s still running after that, sends it a SIGKILL.

The SIGTERM, the delay, the preStop hook and the readinessProbe, taken together, potentially allows you to stop traffic from being routed to your pod, and allow currently running requests to complete before your pod terminates.

Building graceful termination

Until Erlang OTP 20, Erlang had no built-in UNIX signal handling, and while a NIF or two were written, it was easiest to just use the preStop hook to call a special URL in the application, which told it to start serving an error status on the readinessProbe endpoint. The preStop.exec implementation actually runs inside your pod, so it can call endpoints not exposed to the outside world.

This is how we did it at the FT under OTP 19:

  • we exposed a /__traffic endpoint on our application which was the target of the readinessProbe. The Plug that handles this endpoint normally serves 200s, but when a particular ETS table is written to, it switches to serving 500s.
  • The preStop hook runs inside the pod to POST to a private endpoint (not exposed externally, because we didn’t want the world shutting down our pods!), causing a write to the ETS table, and subsequently making the /__traffic endpoint start returning 500s, routing traffic away.
  • After the 30s delay, the pod is terminated by Kubernetes.

We still do the same in our Java apps, since Java doesn’t support OS signals (or ordered shut-down, for that matter). It was ‘good enough’, but felt kind of clunky.

Now that Erlang OTP 20 supports OS signals, we can do something much nicer.

Erl Signal Server

The Erlang documentation for the new OTP 20 signal feature is a bit poor, split over os:set_signal/2, the Kernel app and gen_event. For those more familiar with (the much better documented) Elixir, it’s pretty foxing. So here’s a quick guide.

As well as the more familiar gen_server (Elixir: GenServer) behaviour, Erlang has agen_event behaviour, which is a bit like a mash-up between a gen_server and a supervisor, and consists of an event manager, which receives events, and event handlers to which events are then propagated. Event handlers can be added to and removed from the event manager at any time.

The new OS signal feature is just another gen_event manager, started by default, named erl_signal_server. The manager receives messages, in the form of atoms such as sigterm and sigusr1, when signals are sent to the BEAM OS process, and distributes them to its handlers.

The erl_signal_server has a default event handler automatically added to it, which is defined in erl_signal_handler.erl; on a SIGTERM it immediately calls init:stop/0, gracefully stopping the BEAM. That’s great, but it’s not good enough for us, since we want to let any current requests finish before shut-down.

Customising the handler for graceful shut-down

Because it’s just an event handler, we can remove the default handler from the erl_signal_server, and replace it with our own.

Our new handler will receive the sigterm event, write to the ETS table as our previous version did, and then send itself a message to be processed later, using send_after/2 which will initiate the init:stop/0.

Our code for this is in k8s_signal_handler.erl — its written in Erlang rather than Elixir because I got fed up of typing colons for calling all the underlying Erlang functions!

The start_link/1 function creates the ETS table and swap our handler in place of the default one:

ok = gen_event:swap_sup_handler(
{erl_signal_handler, []},
{k8s_signal_handler, [Table, Delay, Test]}),

Table is the name of our ETS table, also created in init/1. erl_signal_server is the name of the OTP signal event manager. Delay is the amount of time to wait before calling init:stop/0, and Test is a parameter that supports our tests.

Actually, our start_link function doesn’t start the handler process itself, gen_event:swap_sup_handler/3 does that for us, and supervises it too. The function therefore returns the atom ignore which a calling supervisor, erm, ignores; see Supervisor.start_child/2.

The event handler for sigterm is essentially (ignoring test support code):

handle_event(sigterm, {Table, Delay, _} = State) ->
ets:insert(Table, {draining, true}),

erlang:send_after(Delay, self(), stop),

{ok, State}

It updates the ETS table, and sends itself a stop message, after the delay.

Like GenServer, gen_event supports receiving plain ‘info’ messages as well as ones particular to the behaviour, and we use this to shut-down when we eventually receive the stop message:

handle_info(stop, {_, _, _} = State) ->
ok = init:stop(),
{ok, State}

That’s all we really need on the signal handling side, but there is also an Elixir module, FT.K8S.TrafficDrainHandler which supports starting the event handler in a supervision tree via child_spec/1 and start_link/1 and querying the state of the ETS table via draining?/0. This too could have been written in Erlang, but its also a place to attach some nice documentation! ;)

The Plug

As for the Plug side, all we need is a simple plug to serve /__traffic, which checks the draining?/0 function to see if it should be serving 200 or 500; this is our FT.K8S.TrafficDrainPlug module. We can just add this to our router:

plug FT.K8S.TrafficDrainPlug

Kubernetes configuration

We set up the readinessProbe to call our /__traffic endpoint, as we described previously:

name: ...
image: ...
- containerPort: 8080
path: /__traffic
port: 8080

initialDelaySeconds: 20
periodSeconds: 2

Starting the handler

We also need to remember to start the TrafficDrainHandler at some point, e.g. in our Application.start/2 function:

def start(_type, _args) do
  # Define workers and child supervisors to be supervised
children = [
# Start the endpoint when the application starts
{MyApp.Web.Endpoint, []}, # Phoenix
{FT.K8S.TrafficDrainHandler, k8s_drainer_opts()}
  opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
defp k8s_drainer_opts do
Application.get_env(:myapp, :connection_draining, [])

The k8s_drainer_opts/0 function just allows us to set some config options for the drainer, such as shutdown_delay_ms — we set this in our dev config to avoid having to wait 20s for the VM to terminate after a Control-C:

# config/dev.exs
config :myapp, :connection_draining,
shutdown_delay_ms: 1

It works

If you needed proof that all this is required, look no further than these couple of actual (but redacted!) log entries from one of our apps during a deployment:

11:47:09.538 [info] method=POST path="/xxx" type=Sent status=403 duration=253
***K8STrafficDrain: SIGTERM received. Draining and then stopping in 20000 ms
11:47:11.444 [info] method=GET path="/xxx/xxx-svc"
11:47:11.538 [info] method=GET path="/xxx/xxx-svc" type=Sent status=200 duration=94345
***K8STrafficDrain: Stopping due to earlier SIGTERM

Kubernetes sent us a SIGTERM and then sent us another HTTP request: if we’d have started shut-down immediately, this request would have failed.

End notes

If you are using Distillery to create releases, ensure you use a more recent version that supports OTP 20: previously Distillery’s scripts trapped and handled signals, but if you are running on OTP 20 they now propagate signals to the Erlang BEAM process, allowing our replacement signal handler to do its work.

The whole repo is on Github. Enjoy!

2017–11–23 — corrected typos, simplified handler code, added ‘It works’.