Nightmare on Actor Subtree Shutdown
Back when we were young, pretty, and very new to both Scala and Akka, we somehow managed to cause a NullPointerException
(NPE) in what should have been a simple Actor Shutdown use case. Not one of our finest moments. While investigating the issue we got to know a bit more about the Akka infrastructure. This is the story of how we got there, and how we got out of there.
The setup is simple. We have trait called SuicideActor and, as its name implies, it’s an actor that runs a block of code and then kills itself. One of our SuicideActors creates another SuicideActor as its child.
The NPE was thrown as the result of a wrong combination between Actors and Futures:
The SuicideActor’s block was wrapped in a Future, inside of which we called the context stop self
.
That lead to a situation where the Future created by the child Actor was still running after the child was already dead — it died of not-so-natural causes: the similar Future created by its parent has already completed, thus called context stop self
already, and child-actors are stopped when the parent is stopped. This meant that the child’s context was null
. First rule of Actors and Future is never change an Actor’s inner state in a Future.
First Attempt: PoisonPill
Our first attempt to solve this issue was to use an alternative to context stop self
— sending a PoisonPill
. The PoisonPill
is considered more graceful in comparison to context stop self
, because the PoisonPill
is just another message added to the Actor’s queue, thus all preceding messages are guaranteed to be processed before the Actor is shutdown.
This solution did solve the NPE (yay!). Alas, sometimes the child Actor did not run its block (d’oh!). Not the side-effect we were expecting…
After digging a bit deeper into Akka code, we learned that when an actor receives a PoisonPill
message, it actually calls self.stop()
. So while the parent gets to process all of its pending messages, its children might not — they are brutally murdered with context stop
.
We couldn’t override the behavior of the PoisonPill
in the receive
method because PoisonPill
is an AutoReceivedMessage
, and an AutoReceivedMessage
doesn’t get to the Actor’s receive
method, but is rather processed by a autoReceiveMessage(msg: Envelope)
method implemented in the base Actor.
Second Attempt: Custom Messages to Children
After almost taking a poison pill ourselves, we realized this “graceful cascading shutdown” requires some custom implementation. The new mechanism we came up with included 2 new messages:
case class PleaseKillYourself()
— sent from parent to children, instructing them to kill themselvescase class IKilledMyself()
— sent from a child to its parent letting it know it killed itself
The flow here would be:
- Parent sends
PleaseKillYourself
to children - Children kill themselves gracefully (with this same flow) and reply with an
IKilledMyself
message - Parent kills itself only once all children replied with
IKilledMyself
(You can only imagine how lovely the office conversations about this sounded).
The challenge here is to implement this once in a trait that can be easily extended by any Actor — we didn’t want every actor to implement the handling of these 2 messages in its receive
method. So we needed to find the right hook to process those messages outside of the receive
method implemented by each Actor. We decided to override the unhandled
method of the Akka actor. By default, unhandled
will publish an UnhandledMessage
event of the ActorSystem for every message it captures (except for the Terminated
message). The event eventually triggers a push to the dead letter queue.
In our implementation, the unhandled
method handles the PleaseKillYourself
message before falling back to the default implementation. Once an Actor receives a PleaseKillYourself
message, it would send a PleaseKillYourself
message to all of its children, and then wait for the IKilledMyself
messages using the become
pattern. The Actor would thus ignore any other message from that point on (as it should!).
This solution worked, but it was messy: overriding base methods in your infrastructure might expose you to bugs and failures that the authors of the library didn’t expect.
Third Attempt: The Reaper Pattern
After a lot of research, we finally discovered the Reaper pattern. The Reaper pattern solves the problem of shutting down the ActorSystem when only once all Actors finished handing their messages (AKA graceful shutdown for the ActorSystem). In this pattern we create a Reaper directly under the user Guardian and, as its name implies, it “reaps” the souls of other Actors, by watching them. Once the Reaping is finished, it signals the user guardian that it’s safe to shutdown.
In our solution, we decided to treat every subtree of Actors (with a SuicideActor
at its root), as a mini “ActorSystem”. Once the SuicideActor
ends its block, it creates a dedicated Reaper under the ActorSystem. This Reaper watches over all the SuicideActor's
children, and once all of them are dead, it sends the SuicideActor
a PoisonPill
and kills itself.
New SuicideActor
code now pretty straightforward:
The Reaper implementation is slightly more involved:
A nice side-effect of this design is that the SuicideActor
scope is now split in half: The SuicideActor
is now responsible only for running the block, while the Reaper
does the heavy-lifting of handling the subtree of children.
A Happy Ending
With the Reaper pattern, we finally achieved graceful shutdown without abusing the Akka infrastructure. It’s also easy to test — both Reaper and SuicideActor
can be unit-tested like any other Actor, while the entire flow can be validated with an integration test.
To this day, Reapers live (and kill) happily in our production systems.