Self-Healing: How to Keep Your Systems Live

Nuno Diegues
Feedzai Techblog
Published in
5 min readMay 22, 2019

Failures are inevitable when it comes to any sufficiently large distributed system. As discussed in our post about high availability, maintaining such systems live is extremely hard. Even if your monitoring systems are perfect and you have someone looking into the issue 5mins later, chances are it will take 20mins to come up with a solution, and by then you will have wasted your monthly budget for a Service Level Agreement (SLA) of 99.95%.

Therefore, this raises the one million dollar question: how can we set up self-healing capabilities so that recovering a faulty system can be handled without human intervention (and hopefully faster)?

Doing that is key to coping with strict availability SLAs. In this post, we share our experience at Feedzai, which has enabled us to maintain availability in critical systems even in the face of failures.

Evolving stateful applications: tougher than you think

Building a self-healing application may be harder than you think. While for some applications all it takes is a restart, for others it may not be that simple.

Consider a stateful application whose configuration and state evolves over time. At Feedzai, our real-time fraud detection system is regularly updated with new machine learning models and rules. Imagine this simplified scenario:

  1. Machine A and B are serving real-time traffic
  2. A deployment adds a new machine learning model to the system
  3. The change is successfully rolled to machine A
  4. However, it causes a crash in Machine B

How can the system self-heal? Should Machine B be restarted and use the latest configuration (that is being used in Machine A)? But what if that configuration doesn’t fit Machine B and it never works? Then it is perhaps better to rollback and discard the new model. However, doing such a rollback automatically may not be possible, depending on how step 2 is performed.

Different ways to make changes to a system

The key to solving the problem above is by carefully designing how you represent the configuration of your application and how changes are applied.

You can specify what you want to achieve (a declarative approach). Examples:

  • Put a new model live in production.
  • Rollback to the last healthy state.

Or you may specify what changes you want to make (an imperative approach). Examples:

  • Place the executable binary for the machine learning model in the file system of each application server. Add the model to the metadata of the application. Load it in memory in Machine A. Wait for some time to validate its correctness. Load it in memory in Machine B. Validate final state.
  • Something went wrong in Machine B. Restart it. Restore backup of previous model in the file system. Unload the new model and load the backup. Wait for some time to validate its correctness. Repeat in Machine A. Validate final state.

Which one do you prefer? Well, of course we are hiding a lot of details here, but overall declarative approaches are much more amenable to automation, because you are trusting the system to reach the end goal in the best way it sees fit, rather than you controlling every detail of how to get there.

This becomes even more important in self-healing because if you let the system control the procedure, then the system will be able to rollback and retry on its own in order to seek stability and liveness.

In contrast, imperative approaches make the system dependent on Support/Operations engineers to instruct on how to return to a healthy state. With such approaches, it is hard to sustain high levels of availability, since humans inherently slow down recovery procedures.

Although a very old debate, this discussion on whether to use declarative vs. imperative configuration state changes is still at the core of modern systems such as Kubernetes. It is interesting to see these sorts of classical problems re-emerge periodically in new systems. In fact, as the technological landscape changes, we keep revisiting the past, and sometimes we end up adopting new trade-offs and decisions!

Feedzai Watchdog architecture

We have instantiated the declarative approach mentioned above in our own Feedzai Watchdog. This component monitors our real-time event processing and ensures:

  • Availability — all replicas are alive
  • Consistency — every live replica is using the same configuration

To enable this, we rely on a ZooKeeper cluster to provide two key building blocks of our architecture:

  1. Leadership election — among our replicated Watchdogs, only the leader will take effective actions
  2. Group membership — all application replicas belong to a ZooKeeper group, where they also record the configuration currently loaded in-memory, which provides a strongly consistent view of the overall system

Whichever the Watchdog leader (1), it performs periodic checks over the group membership (2) of the application. Based on that view, it calculates whether the current state is correct when compared to what is stored in the metadata database as being the last stable state. If there are enough consecutive checks with wrong states — so that we tolerate flaky networks or hiccups — the Watchdog devises a healing plan to bring the cluster to a healthy state.

This is where the declarative approach is important: thanks to that, the Watchdog is capable of deciding on the individual steps that will fix the cluster, regardless of which state it is currently in (applications down, unresponsive, in wrong versions of the configuration —you name it!).

Behind the power of the declarative approach for the configuration of our applications lies the following key traits:

  • Metadata changes are strongly and consistently persisted in a versioned append log
  • All those changes are individually reversible
  • Users define goals for changes, but it is up to the system to compose the individual metadata changes that result in that goal (e.g., deploying a new model)

Based on that, the Watchdog composes healing actions that request the applications to change their state to a desired version of the metadata append log. Should any application replica be unavailable (from the perspective of the ZooKeeper membership group), the Watchdog will start a new replica (and kill any lingering zombie process for the replica being replaced, which may not be responsive right now, to prevent future problems).

The fact that the Watchdog runs periodically provides stronger safety guarantees: as long as we have some Watchdog replica still running, there will be a Watchdog leader, and we know that the cluster will eventually be healed and reach a healthy, stable state. That also allows for the toleration of ZooKeeper membership fluctuations, assuming that it also converges to a consistent state —and that’s exactly what ZooKeeper is meant to do by solving the Consensus problem.

If this sort of challenge is what drives your imagination, if keeping systems alive with demanding SLAs is what you love to do, or if you just want to learn more about all of this, then be sure to follow up with us, because we have a lot more to tell!

--

--