StakeBaby
Published in

StakeBaby

Automatic Validator Failover with One Microservice

for Moonbeam, Moonriver, and other Substrate-based Validators/Collattors (Polkadot, Kusama, etc.)

A validator’s job is to get the node up and keep it running.

If not forever, then at least 100% of the time.

In this article, we will set up a microservice that runs every X minutes, checks our Validator nodes, and performs a failover if something is wrong.

Our program works for the Moonriver and Moonbeam networks. You can make it work for other substrate-based (Polkadot, Kusama, etc.) networks with little effort.

All the code is available on Github.

Note that this solution will store your Proxy secret key online. If you do not want to store any keys online, then you will have to implement a more advanced solution.

Get the Block Status of our Nodes

Our failover system needs to watch the blocks of our active Validator/Collator and the backup nodes. There are a few ways to do this:

* Subscribe to chain block events and initiate failover if our node hasn’t signed a block in X period

* Connect to Prometheus and watch the last imported block

* Create Grafana alerts

* Connect to telemetry and watch the last imported block

We have chosen the Telemetry approach.

You should set up your own private telemetry on a private server. Our focus is on reliability rather than performance, so I recommend a cloud-based virtual server with highly redundant storage and 4 GB Ram.

Follow the directions on the Substrate Telemetry Github repo to get your private telemetry running. If you want, you can limit the IPs that are allowed to it to the IPs of your nodes and your home/office IP.

There is nothing to stop you from using public telemetry, but you are likely to run into problems. The public telemetry connects to thousands of servers (not just Moonriver) and it has been down on a number of occasions due to heavy load. Sometimes, nodes disappear from the telemetry due to max count limits (default is 500). Finally, the format of the telemetry WebSocket feed may change without notice, breaking your failover logic.

You can configure your nodes to connect to more than one telemetry servers if the network requires you to connect to their public telemetry.

Clone, Edit and Compile the Failover Service

We have written a microservice that is ready to be deployed as an AWS Lambda function. The function does not use any other AWS services, so you can easily adapt it to work with other cloud providers.

Clone the Moonriver failover micro repo. Make sure you have NodeJS 14 and npm installed. Go inside the cloned repo and run

npm installnpm run build

The commands will download all required libraries, compile and zip the program inside the dist folder. That’s it! You are ready to deploy the service!

Wait… what?

I wouldn’t trust my own repo for something as sensitive as my Proxy key! Before deploying the code, you need to go through it and make sure it does what it says it does.

You will need to inspect all files under src/. The business logic is inside str/index.ts. The src/telemetry directory has code that is (mostly) copied from the Polkadot telemetry Github repo. You should also go through package.json and make sure you trust all the dependencies. Finally, check webpack.config.js for any funky middleware.

Deploy and Configure

It’s time to deploy our program to the cloud. You will need an AWS account for that. Choose a location that is in the same continent as your nodes.

In your AWS account, go to Lambda, and create a new function with the default settings (NodeJS 14, x86_64, create new role).

Click on Upload from and select zip file. Upload the zip file from the dist folder inside the cloned repo.

Click on Configuration -> Environment Variables -> Edit. Click on Add environment variable. You will need to add the following variables.

TELEMETRY_URL

This should look like ws://X.X.X.X:8000/feed/, where X.X.X.X is your private telemetry server IP. Note that the WebSocket is not secure (ws instead of wss). If you want to use the public telemetry you can leave this empty as it defaults to wss://telemetry.polkadot.io/feed/

TESTING_MODE

While testing, set this to true to make sure that the service does not actually execute the failover extrinsic.

BLOCK_LAG_THRESHOLD

Your service will trigger a failover if your active node is BLOCK_LAG_THRESHOLD number of blocks behind the chain current height. Default is 20.

PROXY_SECRET_KEY

This is your proxy’s secret key, starting with 0x. Consider storing this key in a different place (other than the Env variable). For example, you can store it in AWS Secrets Manager, or in DynamoDB.

NODE_NETWORK_IDS

Network IDs are strings that appear on telemetry, identifying the node. If you don’t see them on your telemetry, click on the Settings button on the top right, and activate Network ID.

Each node has its own network ID. You need to make a comma-separated string of all network IDs, including the active and backup nodes. Enter the IDs in order of decreasing priority. If the service decides to perform a failover, it will choose the highest priority (first) healthy backup node.

If you don’t include a network ID of a node, the service won’t see it.

If you include a network ID of a backup node that is down, the service will run notify() that is intended to notify you. Note that, it is up to you to implement the notify method to suit your requirements.

SESSION_KEYS

Finally, you need to enter the session keys of the active and backup nodes. Each session key should start with 0x. Enter the session keys as a comma-separated list, in the same order as the node network IDs.

Testing

The easiest way to test this is to mock a node failure by replacing

if (!node.block || (node.block < chainBlockHeight — blockLagThreshold))

with

if (!node.block || (node.block < chainBlockHeight — blockLagThreshold) || networkID == 'TEST-NETWORK-ID')

Make sure TESTING_MODE env variable is true, if you don’t want to execute the updateAssociation extrinsic. Go over to the Cloudwatch logs and see if the function did what it should have done.

If everything looks ok, you can then perform a simulated failover by deleting the TESTING_MODE env variable

Launch

Good job! Your setup is production-ready. The last step is to add a trigger that will run your function every X minutes.

Go back to your Lambda function and click on Add trigger. Select EventBridge (Cloudwatch Events). Create a new rule. Enter the following in Schedule expression, replacing the number 10 with your desired frequency.

rate(10 minutes)

Your function will run for the first time as soon as you click the Add button, and it will keep running every X minutes.

Downtime Considerations

This failover setup is as basic as it gets, and it will put you ahead of most Validators. However, running a microservice every X minutes could result in your node being down for X minutes. If you cannot afford this downtime, then you need to adapt this program to run constantly as a service.

Congrats on automating your failover. And good luck if you didn’t!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ioannis Tsiokos

Ioannis Tsiokos

I have nothing to say that’s nearly as cool as I am, except maybe… wow, I am dad!