Swoosh crashes

Post incident bug investigation

Published in

oVice

8 min readSep 9, 2022

At oVice, we are migrating our main API service to Elixir using Phoenix as a web framework. We use Kubernetes to replicate it to pods much like many other services we run. For a few months now, we have noticed sporadic restarts of this recently migrated service. These restarts were the result of unexpected application crashes that occurred without any anomalous behavior preceding them.

At first, it was fairly unusual; once a week, a pod would crash, prompting Kubernetes to immediately create a new healthy one, and the problem was resolved. However, as the load increased over the past two months and more endpoints were migrated to the new service, we started to notice more and more occurrences of this issue.

It started happening more frequently, up to a few times each day, and even recently created pods were occasionally crashing and remaining in a CrashLoopBackOff state. Additionally, it started creating disruptions and exacerbating every incident we experienced in August.

These are logs of our pods’ state at its worst during our last major incident:

NAME                          READY STATUS           RESTARTS AGE
ex-api-7c4485df8-2klbf        1/2   CrashLoopBackOff 7        18h
ex-api-7c4485df8-826r9        1/2   CrashLoopBackOff 8        18h
ex-api-7c4485df8-9cnbx        2/2   Running          7        18h
ex-api-7c4485df8-cghgc        0/2   Init:0/1         0        2s
ex-api-7c4485df8-f7dnq        1/2   OOMKilled        7        18h
ex-api-7c4485df8-nl9gn        1/2   CrashLoopBackOff 6        18h
ex-api-7c4485df8-nnzjs        0/2   Init:0/1         0        2s
ex-api-7c4485df8-ph6qj        1/2   CrashLoopBackOff 7        18h
ex-api-7c4485df8-t95xp        1/2   CrashLoopBackOff 4        18h
ex-api-worker-9d97875cf-d7vc7 1/1   Running          0        18h

Due to the constant failure of pods, healthy ones became overloaded with the number of requests to serve, causing them to generate a large number of errors before getting killed by Kubernetes for exceeding their memory limit. This was one of several cascading failures caused by the sudden crashing issue.

This is a log sample of Kubernetes events while it was happening:

ex-api   34m    Warning   Unhealthy    pod/ex-api-74d4ddf78-nbtjf   Readiness probe failed: Get "http://100.0.33.211:4000/ready": dial tcp 100.0.33.211:4000: connect: connection refusedex-api   34m    Warning   Unhealthy    pod/ex-api-74d4ddf78-lfpbm   Liveness probe failed: Get "http://100.0.34.76:4000/healthz": dial tcp 100.0.34.76:4000: connect: connection refusedex-api   34m    Warning   Unhealthy    pod/ex-api-74d4ddf78-xcfp2   Liveness probe failed: Get "http://100.0.36.5:4000/healthz": dial tcp 100.0.36.5:4000: connect: connection refusedex-api   34m    Warning   Unhealthy    pod/ex-api-74d4ddf78-lfpbm   Readiness probe failed: Get "http://100.0.34.76:4000/ready": dial tcp 100.0.34.76:4000: connect: connection refusedex-api   34m    Warning   BackOff      pod/ex-api-74d4ddf78-k9z8b   Back-off restarting failed containerex-api   34m    Warning   BackOff      pod/ex-api-74d4ddf78-lfpbm   Back-off restarting failed containerex-api   34m    Warning   BackOff      pod/ex-api-74d4ddf78-p2g5j   Back-off restarting failed container

Falling Kubernetes health probes tell us that the service is unresponsive but they don’t answer why. Our health check endpoints are as simple as they could be, so it definitely wasn’t the fault of a dependency system:

def healthz(conn, _params) do
    render(conn, "health.json")
end

Another unfortunate detail that made debugging this issue more difficult was that we were redirecting the erl_crash.dump file to stderr which caused the kubectl logs command to truncate the output and only show the tail of the file, which wasn’t very useful in determining the source of the crash.

While the SWE team was dealing with the unknown crashing issue, our SRE team was investigating better ways to keep these crash dumps.

Mistake and remediation

One interesting aspect of this issue was that it also occurred while deploying newer versions of the API, meaning that while new pods are taking over and old pods were normally exiting (as the old ReplicaSet is being scaled down), one or multiple random old pods would crash:

Using that fact to our advantage, we created a new deployment during off-peak hours and streamed multiple pod logs locally to finally capture a crash dump:

[notice] global: Name conflict terminating {Swoosh.Adapters.Local.Storage.Memory, #PID<57113.4096.0>}
[notice] Application swoosh exited: shutdownServer: api.ovice.in:80 (http)
Request: POST /api/v1/workspaceUser/leave/workspace
** (exit) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started{"Kernel pid terminated",application_controller,"{application_terminated,swoosh,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,swoosh,shutdown})Crash dump is being written to: erl_crash.dump...=erl_crash_dump:0.5
Fri Aug 19 10:37:36 2022
Slogan: Kernel pid terminated (application_controller) ({application_terminated,swoosh,shutdown})
System version: Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:4:1] [ds:4:1:10] [async-threads:1] [jit:no-native-stack]

All the logs generated by multiple crashing pods mentioned that the swoosh OTP application was the reason why the Erlang Run-Time System (ERTS) was exiting.

So what is Swoosh? Swoosh is the mail delivery library that we currently use. It’s a fairly common one in the Elixir ecosystem and is popular enough to ship as the default email client in all new Phoenix projects. (generated by the mix phx.new command)

Simply searching for ({application_terminated,swoosh,shutdown}) led us to this Github issue where a pull request was made to address this particular problem. This PR added an optional config to disable a GenServer that was intended to only be useful in development. This functionality is mentioned within Swoosh’s documentation.

An oversight on our part, but we didn’t consult this documentation much due to the fact that the Swoosh dependency was already present when we first generated the project with mix phx.new. A review of these initial configurations was needed as we continued to scale.

We immediately deployed the new configuration to our production environment, and we are pleased to report that our API has not crashed since. The option we changed was:

# config/runtime.exs
config :swoosh, local: false

If unspecified, the default value for local was true.

Root cause and replicating the bug

At oVice, we strive to better understand our systems, so once the problem was resolved, we began investigating the root cause of why it was occurring.

Erlang applications

First thing to note about OTP applications is that there are three types of application start types:

permanent: If a permanent application terminates, all other applications and the runtime system are also terminated.
transient: If a transient application terminates with reason normal, this is reported but no other applications are terminated. If a transient application terminates abnormally, that is with any other reason than normal, all other applications and the runtime system are also terminated.
temporary: If a temporary application terminates, this is reported but no other applications are terminated.

https://www.erlang.org/doc/design_principles/applications.html#application-start-types

Swoosh, like many other OTP applications, starts as a permanent application which means that if its top-level supervisor exits for whatever reason, the whole runtime system will exit too.

It’s actually really easy to test this out. Assuming you have an Elixir application with Swoosh as a dependency and both are started as permanent (start_permanent: true in both applications’ mix.exs files) all you need to run is:

iex(node0@127.0.0.1)2> Swoosh.Supervisor |> Process.whereis() |> Process.exit(:kill)
true
iex(node0@127.0.0.1)3> [notice] Application swoosh exited: killed
{"Kernel pid terminated",application_controller,"{application_terminated,swoosh,killed}"}
Kernel pid terminated (application_controller) ({application_terminated,swoosh,killed})Crash dump is being written to: erl_crash.dump...done

Development GenServer

Swoosh’s supervision tree is one of the most straightforward ones. Other than the top-level supervisor it has a GenServer called Swoosh.Adapters.Local.Storage.Memory:

This GenServer is responsible for storing emails in an in-memory state to preview them in a simple UI when in development. Additionally, it is created globally to support multi-cluster setups and shares state between all nodes. It is also enabled by default in all environments unless the local: false configuration mentioned earlier is given.

Looking into how this GenServer is created we find:

https://github.com/swoosh/swoosh/blob/937ec20823c1ca62dbf89a00cb45355392f6e4f3/lib/swoosh/adapters/local/storage/memory.ex#L15-L24

def start_link(args \\ []) do
    case GenServer.start_link(__MODULE__, args, name: {:global, __MODULE__}) do
      {:ok, pid} ->
        {:ok, pid}{:error, {:already_started, pid}} ->
      Process.link(pid)
      {:ok, pid}
    end
  end

The notable part of this code snippet is that if the global process is already up in another node, the calling process of the start_link method (which in this case is the top-level supervisor of Swoosh) links to that global GenServer.

Despite the fact that links are bidirectional, a crash in the global GenServer will not terminate the corresponding supervisor. This is because supervisors trap exits in order to detect when a child process is down and if/when try to restart it:

https://github.com/erlang/otp/blob/12664b919f5570b114377a21575bbd60b37780fe/lib/stdlib/src/supervisor.erl#L608-L614

handle_info({'EXIT', Pid, Reason}, State) ->
    case restart_child(Pid, Reason, State) of
    {ok, State1} ->
        {noreply, State1};
    {shutdown, State1} ->
        {stop, shutdown, State1}
    end;

Ultimately, the link is set up to notify all Swoosh’s top-level supervisors located in external nodes when the GenServer is no longer alive which forces them to either create a new one or link to the newly generated one.

Another interesting aspect of Erlang supervisors is that they won’t restart a child process forever. By default supervisors will only restart a process three times in a five seconds interval. If this frequency is exceeded, the supervisor terminates all of its children and quits.

We found that with this supervision setup, it is rather simple to hit this restart frequency. This results in crashing the top-level supervisor, the application, and the whole node’s Erlang VM.

To replicate the issue we need to run two nodes of our application:

PORT=4000 iex --name node0@127.0.0.1 --cookie test -S mix phx.server
PORT=4001 iex --name node1@127.0.0.1 --cookie test -S mix phx.server

Note that we need to pass the same cookie argument so that these nodes can form a cluster together.

And then from one of these nodes let’s try to connect/disconnect nodes more than three times in less than five seconds, e.g:

def trigger do
    Node.connect(:"node1@127.0.0.1")
    Process.sleep(400)
    Node.disconnect(:"node1@127.0.0.1")
    Process.sleep(400)
    Node.connect(:"node1@127.0.0.1")
    Process.sleep(400)
    Node.disconnect(:"node1@127.0.0.1")
end---------iex(node1@127.0.0.1)2> Ovice.trigger
Restarting GenServer
Restarting GenServer
Restarting GenServer
true
iex(node1@127.0.0.1)3> [notice] Application swoosh exited: shutdown
{"Kernel pid terminated",application_controller,"{application_terminated,swoosh,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,swoosh,shutdown})Crash dump is being written to: erl_crash.dump...done

Libcluster

The final piece of the puzzle is that we use the libcluster library to connect our API instances into one Erlang cluster, more precisely the Cluster.Strategy.Kubernetes strategy. This strategy leverages Kubernetes endpoints to retrieve a list of living pods and to connect to them.

Our theory is that the returned endpoints appear to have been slightly inconsistent at times, causing nodes to connect/disconnect many times to other nodes in a short time frame, triggering the Swoosh bug.

This theory was later highlighted by the fact that we experienced crashes while deploying entirely unrelated services into our Kubernetes cluster.

Follow-up steps

Even though we resolved the problematic GenServer through changing our configuration, we think it’s preferable to eliminate any possibility of triggering this bug by making one of the following changes in Swoosh:

Creating a manager GenServer locally that monitors the global GenServer, hence taking control of the whole restarting/monitoring process. (Already suggested and made it in v1.8.0)
Having the process off by default and better document how to activate it for the development environment. While this solution is ideal, it breaks configurations and it’s harder to make this change.
Dropping the cluster support for the development environment.