Self-healing network — cutting-edge IoT

gal brandwine
Machines talk, we tech.
4 min readMay 24, 2023

At Augury, our customers are our heroes, it is known.
We do our best to provide value.

Our installation teams strive to efficiently mount our battery-powered endpoints (A.K.A EP’s) onto machines and install our gateways (A.K.A Nodes) on walls, all while upholding exceptional standards and thoroughly testing everything before departing the site.

Our customer success team works tirelessly to instill confidence in our product and ensure customers successfully start using our tools while trusting our insights.

Our support teams strive to efficiently close all open tickets and minimize the time for open issues. Making sure everything comes back online and ticks as it should.

It sounds like a well-oiled machine ready for the battle of scale, right?

Our value to customers is maximized when data from all our EP's reach way up to our cloud; there, our AI kicks in.

The customer's machines come in many shapes and are installed in various environments, exposing our EP’s to some of the harshest conditions electrical parts can withstand.

The harshest environments. From direct sun, desert heat, and boiling oil drips to cement splashing, our EP's have it all.

Sometimes, issues arise.

  • Customer’s maintenance teams might damage our Nodes, unmount them, or move them elsewhere (After all, they try to do their jobs, maintaining, replacing, and fixing the same machine we indicate a fix is needed)
  • A Node might have unstable internet connectivity
  • An EP battery might’ve run out (Yeah, even our battery runs out after many years)
  • We can even experience bad reception between an EP and its Node

When scaling up, it becomes impossible to fix all these problems manually, regardless of our team's willingness

Although we have automatic alerts, emails, slack bots, and monitoring systems, it's impossible to address them all manually.

Regular cloud-based communication can easily fix it by adding other servers, pods, or instances — balancing load so there are no fail points.

We are industry 4.0 pioneers. We have no such luxury.
“Mounting another Node on the wall” just to improve Ep’s coverage is a complex and expensive procedure, not to mention time-consuming, that eventually inflates time to value for our customers. No one really solved it in the industry…

Here is our innovative self-healing network

Lately, we have released the Self-healing network service. Seeking for ill machines with unstable connectivity to their Nodes and automatically remapping them to a better Node, making them online again.

We implemented it elegantly using a feature our Nodes have had for many years — a nearby endpoints scan mainly used for research and manually pinpoint operation;

Each day, each Node reports the EP’s around it:

Each Node reports the EP’s surrounding it, regardless of those it communicates with.

We’ve added our Self-healing service as a subscriber:

Our service subscribes to those raw events and maintains a collection of <EP, Node> pairs, representing the communication between the two.

Once the optimization algorithm is triggered, it finds a better Node for unstable machines.

In a nutshell

The Self-healing network simply, and automatically, changes from this:

Machine mapped to a Node for some reason, sess 3/4 EP’s

To this:

Machine mapped to Node that fully covers all its EP’s (see’s 4/4 Eps)

Automatically optimize our value to customers.

Swift impact on production

Using “Serviceability” — a way we “measure” how fruitful the data is coming from the customer's machines. We can see the changes since enabling the service.

A customer site’s Serviceability of the past year (representing 41 machines). Higher is better

Once enabled, the Self-healing network service quickly restores our customer's machine’s “Serviceability”.

Sum up

I’ve learned so much while implementing the Self-healing network service, from BE perspective to Golang best practices, while practicing TDD, playing with math, and doing what I love the most — writing algorithms & implementing innovative ideas.

Deploying the Self-healing network service improved Augury’s value to customers, reduced the load from Support teams, and opened the door to future improvements—fantastic work done only by three people.

Cheers 🍻
Gal

--

--