Moonbeam Node Snapshots — The CertHum Approach — Part I

CertHum

Published in

CertHum

7 min readSep 2, 2022

Introduction

There are three public networks currently running on Moonbeam code and supported by the Moonbeam Foundation:

Moonbase Alpha — the test network that is used for development and testing purposes (considered production by collators and other providers that support Moonbase Alpha infrastructure).
Moonriver — the EVM-compatible Moonbeam canary network supporting dApps built with cross-chain connected contracts and is a parachain on the Kusama network. You can find more about the Moonriver network here.
Moonbeam — the EVM-compatible Moonbeam network supporting dApps built with cross-chain connected contracts and is a parachain on the Polkadot network. You can find more about the Moonbeam network here.

Running a node on any of the above networks, which involves keeping a full and continuously updated copy of the parachain database (every block including the transactions since genesis), and at least a partial copy of the full blocks in the associated relay chain database, is done for many reasons. Regardless of the reason, if you are running a node an assessment should be made as to the impact if that node were to fail. For those running apps which query a node, the impact could be a loss of service to the app users. For those that are running a node as a collator, the impact might mean the halt of block production — slowing down the chain and stopping rewards to delegators. There are many other scenarios that could be relevant to you, and an assessment will help you to figure out what your exposure is in the event of node failure.

CertHum is an active collator on all three Moonbeam networks — Moonbase Alpha, Moonriver, and Moonbeam, and we are also active validators and collators on many other networks and have been so for years. Since we started collating on Moonbase Alpha over a year ago, we’ve picked up tips from other community collators, and combined that with some of our own best of breed operational practices to develop backup and recovery plans for the three Moonbeam networks. We think these plans will help us recover from almost any disaster and are sharing them with the community to help support better practices across the Moonbeam network of chains. What follows is part one, the introduction, of a four part deep dive into our practices, including how we arrived a certain decisions, and the scripts we use to implement them.

In the following sections of this posts we will describe our current collator deployment design which helps to support our backup and recovery plans. In future posts we will deep dive further into the providers we use to store backups, and take a step-by-step walkthrough into the scripts we use. In part four we will walk through and provide the scripts to recover the databases to alternative sites for node recovery, and we will look at how our design enables the ability to deploy a fully synched node anywhere in the world, in less than one hour. After reading through this series, any node operator should feel confident to create their own backup and recovery design to support the operations of the Moonbeam network they operate on.

As with everything we do, we are always open to suggestions for improvement, and please feel free to comment or contact us directly if you find that there is a better way to do anything we are doing. We’ve also highlighted some future improvements to our own practices which are documented throughout the series.

High-Level Design — CertHum’s Moonbeam Collator Operations

To understand how we’ve designed our backup architecture, it will help to first give a view into our collator node deployment. Because of the time it takes to deploy a new, fully synced collator from backup we never want to be in a position where our collator services rely recovering from snapshot. Even if it takes less than an hour to recover, that likely means missed blocks on the network and missed rewards for our delegators.

Therefore, we run multiple, fully synced nodes in multiple geographies that are production ready. Production ready means that at any time we can instantly change our active collating node on the respective network to any of the other production ready nodes on the network. The following diagram depicts our node setup on Moonbeam which supports this resiliency.

CertHum’s Backup and Monitoring Design for Moonbeam — High Level

Initially, CertHum maintained two production ready nodes in Europe on separate hosting providers in different countries. In fact, having two countries with different providers was one of the requirements for CertHum to participate in the Moonbeam launch phases at network launch. Since then, due to the war in Eastern Europe, we have added a production ready node in SE Asia because of the potential for a European wide internet attack. We will always keep at minimum two nodes at any given time in different countries, on different providers.

With three production ready nodes, why keep database snapshots at all? The CertHum design is robust enough to handle two production node failures so isn’t overkill to keep a database snapshot?

The reason why we make snapshots is because redundant nodes and snapshots protect against and support two different operational aspects of node running. Multiple, redundant nodes support a failure of a node or node component, provider infrastructure, or anything else which makes a production node unsuitable to be accessed in production (like in the example of a European wide network issue). Snapshots protect against data corruption and facilitate the rapid recovery of a node.

We have a practical example in the real world that will help to explain the importance of having a private snapshot. Kusama client version 0.9.11 was released with a caveat that due to a database migration, the client could not be rolled back to 0.9.10 on a node. If there was any network impacting issue where a client downgrade was required (this has happened once before), unless you had a snapshot of a database from client 0.9.10 or lower, you would need to resync the chain which would take multiple days. In the unlikely event something like that were to happen, would you want to be contending for bandwidth with all of the other node operators attempting to download a public snapshot, or would you like to download your own private copy without any contention? Of course, a staggered upgrade process leaving multiple days in between client software upgrades may help mitigate against this scenario, but the fact is you will be better off with your own private copy of the chain database snapshot if the time comes when it is necessary.

In addition to the design described above, CertHum also runs a dedicated node for backups. The process of stopping the chain client software, compressing the chain database, and pushing it out to multiple destinations can take many hours. To use a production ready node for this purpose would defeat the purpose of having a production ready node, and we never want to be in a position where we need to failover and a server is in the middle of a backup. The dedicated backup node doesn’t need to be a high specification server, it just needs to be powerful enough to do the work in a reasonable amount of time and have suitable bandwidth to move the large backup files (100GB+) over the internet. We use an inexpensive bare metal server for our backup node.

As shown in the diagram, one of the destinations for the backup files is our public facing web server hosting copies of the chain databases. This server is detached from any CertHum critical infrastructure, and although we follow server hardening processes and protect it with network ACLs, we consider it untrusted to sit in any CertHum management network or other non-public network.

The other destination for the backup files is CertHum’s secure public cloud object storage deployment hosted on Microsoft Azure’s Blob Storage service. Object storage is an inexpensive location to keep these large files, and using a hyperscale cloud provider (Azure, AWS, GCP) allows for rapid recover anywhere in the world. Before the next part in the series, scope out some of the hyperscale cloud providers if you are not using them, and look into the free credit offers they have, which you can use to start setting up your own backup destination.

In this post we looked at CertHum’s high-level design which supports the three major Moonbeam networks, and described some of the reasons why it’s important to create your own risk assessment to determine your backup needs. We also looked into CertHum’s resilient architecture used to support collator operations on the Moonbeam network, and how multiple nodes without snapshots is not sufficient to protect against different scenarios which may impact production operations.

Coming next week in Part II of the series we will detail the information that you can use to lay the groundwork for a robust backup architecture for your Moonbeam nodes. We will detail the reasons we use a hyperscale provider as destination for our snapshots, and look at the dependencies needed to run the scripts we’ll provide in Part III and Part IV of the series. Thanks for reading!

Moonbeam Node Snapshots — The CertHum Approach — Part I

Introduction

High-Level Design — CertHum’s Moonbeam Collator Operations

Written by CertHum