Actor Supervision
Dealing with failures
In the previous post we looked at routers, their types, and how they can be added to your application via code. In this post we’ll look at supervision i.e. how can a supervisor handle the failures of its subordinates.
previous post:
Supervision
Akka works by creating actor hierarchies. The supervisor actor hands tasks to the child actors. Apart from handing tasks to the actors, the supervisor is also responsible for dealing with the failures that may arise among it’s child actors.
When a failure occurs among the child actors, the supervisor is informed and it has 4 options to handle this:
- Restart — Restart the child actor i.e. kill the current child actor that failed and create a new one in its place.
- Resume — Let the child actor keep its current state and continue processing new messages like nothing happened.
- Stop — Shut down the child actor permanently.
- Escalate — Let the supervisor’s supervisor handle this error.
Remember that if the error is escalated and the super-supervisor decides to handle this by, say, a restart, then the supervisor, the failed actor, and all it’s sibling actors will be restarted.
Supervison strategies
Akka gives you two supervision strategies:
- One-for-one strategy — where you restart / stop / resume only the failed actor. This is what you’d mostly.
- All-for-one strategy — where you restart / stop / resume all the child actors of the supervisor because their sibling has failed. You’d use this strategy when each child actor performs one step in a chain of computations. For example, actor 1 performs some computation and hands the result to actor 2 and so on. In such a scenario, if one of the actor fails, the other actors will be in an inconsistent state and thus all the children will need to be handled appropriately.
Creating a strategy
Begin by updating the GreetingsActor object by adding methods to create these two strategies
maxNrOfRetries determines the number of times a child actor may be restarted. If the limit is exceeded, the child actor is stopped. A negative number means that the child actor may be restarted any number of times.
withinTimeRange determines the window of time within which maxNrOfRetries must not be exceeded. This time window acts as a safeguard against logical problems that may cause the actor to crash as soon as it is restarted. For example, due to the unavailability of a resource like a database, file, etc.
Next, update the router methods to use the strategy
Notice that we’ve passed the supervisor strategy to the router. This is because the router will become the supervisor when the routees are created. Similarly, you can update the other router methods.
Next, add a case class
Then, update the code to receive and handle this request
Notice that we throw a different exception depending upon how the actor was asked to crash. We do this to simulate different exceptions that may arise in the actor.
Next update the SupervisorActor to pass the CrashRequest to the GretingsActor.
Now let’s play around with the Main app and see what we get.
We begin by sending a request to each of the actors and printing their counts. Each of these actors has a count of one. Then we cause the first actor to crash with an NPE. Since we restart the actor for an NPE, the state of the actor is lost and it restarts with a count of zero which is then printed in the console.
Notice that since we used a one-for-one strategy to handle the error, only the 1st actor was restarted. Had it been all-for-one, all the actors would have been restarted.
Remember that the router’s supervisor is the SupervisorActor. If the router ever escalates, its sent to SupervisorActor which must be ready to deal with this. Update the SupervisorActor to deal with escalations from its children:
That’s it for a basic introduction to supervision. Play around with different routers and changing the strategies. :)