Our path to true self-healing systems is not settling for fault-tolerance

4 min readJul 1, 2022

What is a self-healing system?

A true self-healing system (both application and infrastructure) does not only maintain a desired state, but does two things:

1. Possible today — observes its own weaknesses and compensates. The system knows its entire state of healthiness (not just one part), making decisions to keep all parts working as best as possible. When the system compensates, it does so according to its vulnerable state until fully stable. The infrastructure acts with knowledge of the application. The application acts with knowledge of the infrastructure. Both act with knowledge of other connected systems.

2. Future — learns about its environment input and adjusts with intelligence, not necessarily muscle. Using muscle, the system just scales to compensate. Adjusting with intelligence requires human notification of incidents and data dumps of incident activity for human analysis, but also behavior pattern recognition for potential protection (this is like noticing a bad driver on the road, and either slowing down or taking a different route to avoid them, or bracing for impact to lower anticipated damage if they cannot be avoided)

This article is 7 years old, but the conceptual ideas it presents are still valid.

Self-Healing Systems

Let's face it. The systems we are creating are not perfect. Sooner or later, one of our applications will fail, one of…

technologyconversations.com

What is our path to true self-healing?

The future of self-healing is in the evolution of Observability. We will move beyond Metrics, Logs, and Traces. Events won’t come merely from logs and state monitoring. We will combine a system’s reaction to multiple patterns of input to generate behavior-driven events. Events will include information from SIEM (Security Information and Event Management) systems as Security Events. Artificial Intelligence will analyze known threat actor behaviors we gain from a cooperative collective of this information live, and known vulnerabilities from static information. This will allow our systems to have the knowledge of identified friendly users, bot users and semantic agents, and threat actors before or at the same time the first packet reaches the outskirt of the holistic system of systems.

This is a complicated task in today’s system of systems that merely passes traffic from one gate to another, among one vendor to another, to a business system, to the infrastructure. We scale our systems using third party system of systems. We rely on 3rd party systems of systems for a measure of protection. The intelligence we collect on who the source is and type of traffic it is sending is often redundantly collected and redundantly identified (or worse conflicting). As a result, even identified traffic behavior patterns are not collectively shared, and often kept within silos as protected intellectual property. While this can be understood in today’s business survival practices, it hampers the future of true self-healing systems.

In my invented ideals, our system of systems will leverage a ribbon of such information that is tagged along with the packets of incoming traffic. Each stream of packets or input from a single source will be isolated, and along with it available to each system component will be a ribbon of observed information. This understanding manifests in a simple and understandable marshaled format that allows for each component to have a quick reflex, and to be able to add its additional expert understanding to the ribbon.

I believe we are on a journey to get closer to this state of intelligence. Today we collect the information and observe, but how do we build our systems to react on such a collected state of information? Today’s systems are monitoring each other based on their domain of intelligence, collecting it and, if rules are not already defined, reporting it to humans to react. If rules are defined to react, are the rules applying muscle (scale and protect) or applying intelligence (divert, isolate, and respond)? A platform that can orchestrate all information and share it within system of systems to put actionable item workflows at our fingertips is the next step. The goal of this journey I envision is a state of a system that can make its own rules — an ability to have artificial proactive reflex to oncoming incidents. That will become possible with the ability for a ribbon of expert understanding to be built as it flows between decoupled but adjoined systems. This will enable the holistic system of systems to exhibit the properties necessary to act as a true self-healing system.

Our path to true self-healing systems is not settling for fault-tolerance

What is a self-healing system?

Self-Healing Systems

Let's face it. The systems we are creating are not perfect. Sooner or later, one of our applications will fail, one of…

What is our path to true self-healing?

Written by Russell Glaue