Our Validation Architecture and Learnings at StakeWith.Us

Michael Ng
stakewith.us
Published in
7 min readJul 26, 2019

Delegating your tokens is the best way to show your support for the network.

Why??

Because…

  • Your staked tokens contributes to the security of the network. The higher the staked ratio, the harder it is to attack the network;
  • Your staked tokens holds voting power that allows you to participate in governance and determine the direction of the protocol. You are allowed to vote differently from your validator (you can’t change the decision of your elected candidate in parliament);
  • You truly own your assets even when you are delegating, and are allowed to effortlessly switch between validators. In a way, you help to keep validators in check and keep incentives aligned for the benefit of the network.
  • And in return for all of the above, you earn network participation yield! Phew, delegation is tough work..

While delegation is usually non-custodial in nature, that doesn’t mean that it is a risk-free activity. Some of the more common risks associated with delegation are:

  • Downtime slashing risk (validator risk);
  • Malicious action slashing risk (validator risk);
  • Liquidity risk as there is an unbonding period (market risk).

We won’t dive much into this, but you can check out our staking slides to learn more about staking, delegator roles and risks associated with staking. In general, the safety of your funds depends on the setup of your chosen validators(s) — and here’s ours!

Network Topology Diagram

Our eventual Validation Architecture! We have an active-passive setup with manual failovers.

Generally, we think that most validators have similar validation architecture design goals, which is to have a secure and redundant setup with high availability. Several leading validators have disclosed their setups and shared their thoughts behind why it is structured as such — and we resonate with many of their reasoning. Do check out the following validation infrastructure posts from the following validators: Iqlusion, Chorus One and Figment Networks.

At StakeWith.Us, we focus on prioritizing security (and redundancy) over availability — This does not mean we do not care about downtime, but that we will always choose the most secure option when it comes to operating our validation nodes, even if it means to have a little bit of downtime.

We have 2 Dell PowerEdge Servers that hosts our validators, located in two Tier 3 Datacenters situated in Singapore. We utilize the Yubi2HSM and have strict internal HSM policies to securely generate roles and set capabilities for validating. All secrets and backups are “multisig-ed” and stored in bank vaults.

Currently, we deploy our sentries across 3 availability zones (US-East, EU and SEA) via 2 service providers (AWS and Hetzner). We are in the process of migrating some of our nodes to Google Cloud and we hope that we will eventually engage more service providers to reduce our cloud infrastructure dependency on major providers. We have the ability to spin up additional sentries as and when needed within a short period of time, and also utilize private nodes in a seperate VPC to connect with other validators.

Our Learnings Thus Far

Our setup and processes have evolved significantly between the time we started out validating on Loom Network Testnet back in December 2018, to the period we went live during genesis of Cosmos Network in March 2019.

Our initial priorities revolves around understanding how HSMs works, sourcing for reliable co-location for our servers and formalizing policies around our staking operations. We settled for a simple setup of having sentries in 2 AZs (US-East and SEA), and 2 validation servers in a Tier 2 Datacenter. Information is piped from our validator to Prometheus for storage, and then visualized via Grafana for monitoring.

Lesson 1: Physical infrastructure shifts/upgrades are tedious

We did 2 major physical shifts for our servers within the last 6 months:

  • To shift our backup server (passive validator) into a separate, Tier 3 Datacenter. Typically, Datacenters have a service level commitment to all tenants which allows them to go down for X hours a year for maintenance work. Having 2 servers in one Datacenter means that there will be a time where both your servers will not be operational due to Datacenter outage. Datacenter redundancy ensures that we will always have 1 server online at all times. It is extremely unlikely that both Datacenters will be down at the exact time;
  • To shift our active validating server (we were live on Cosmos and Loom at that time) into another Tier 3 Datacenter. Yay, upgrade! As the server is in production, we manually switched over all active clients, one at a time, to the passive validator. We then allow the clients to run for a period of time before physically shifting the (now) backup server into its new home.

It was a tedious process with a lot of contingency planning involved as there is no backup validator running during the shift. To all aspiring validators-to-be, it pays to settle for redundancy and quality right from the start!

Lesson 2: VPN really solves your connectivity woes

Prior to utilizing Wireguard, we whitelisted IP addresses of our sentries (via firewall settings) to allow for communication with our validation servers over the public internet. High latency between our sentries and validation servers combined with frequent “peer pong” errors caused us to miss blocks occasionally. Wireguard improves our validation setup in 3 ways:

  • Reduces latency between sentry nodes and validation servers by a factor of 3 to 5 times;
  • Ensure that connections continue to be healthy by re-establishing handshakes at set intervals;
  • Provide cross region, cross provider connectivity;
  • Allows you to establish secure and private subnets with other validator(s) to help with connectivity.

Lesson 3: use Consul for service discovery and health checks

Previously, we monitored the status of sentry and validator nodes with bash scripts that run as cron jobs. We realized that introduced a single point of failure for service checking and we needed a more robust way of monitoring the health of the nodes.

Moving to consul provided us with the following features:

  • Cluster-wide service registration — Each node can be looked up for the catalog of sentries currently running and the respective service ports;
  • Service discovery — Human friendly names for the validator nodes to use as peers, services marked as unhealthy are unavailable and are not able to be reached by the validator;
  • Health checks — A list of script/http/tcp checks can be registered that check for uptime, sync status, etc on multiple nodes to prevent single point of failure.

Lesson 4: Using Terraform to manage our multi-cloud infrastructure

We use Terraform for planning and provisioning cloud resources to host the sentry node applications for all our projects. Each Terraform file is checked into version control, peer reviewed and deployed in a similar process to how software code is managed. Terraform provides us with the following features:

  • Multi-Provider — Able to specify cloud resources for multiple cloud providers in the same standardized grammar;
  • Infrastructure Planning — Each planned change is diff-ed before application to allow staff to review changes before applying.

Our Future Plans

We are constantly exploring how we can make use of existing tools (or build our own) to improve our infrastructure. Some of our work in progress (WIP):

WIP 1: Implementing Nomad to assist with container management

One of the challenges of running a validation service is frequent patches/upgrades for each blockchain project we are validating for that has to be rolled out across all sentry and validator nodes during forks. As the number of nodes grow, this task becomes more tedious and prone to human errors during the roll-out process. Hence, we need to automate the upgrade process in a scalable manner for our nodes.

Nomad is a container orchestration tool that can handle long running services and batch jobs in a cluster. Workloads are specified at the job-group-task level, with support for constraints and affinity. It provide us the following useful features:

  • Ability to assign task groups to nodes by using constraints with meta tags;
  • Flexible choice of deployment strategies; Blue/Green or Canary, for client updates;
  • Ensures that all clients on nodes have the same image;
  • Batch job scheduling automate changes that needs to be deployed across the cluster.

WIP 2: Shifting part of our sentries to Google Cloud

We aim to diversify our cloud dependencies in a progressive, sustainable fashion — we hope to see a fully decentralized compute network which can be utilized for deploying sentries!

WIP 3: Building custom Javascript KMS connector

We are actively exploring new ways to secure our validator setup with more protection against double signing and at the same time, provide highly-available signing systems to support multiple projects. Towards this end, we aim to spec and build a Javascript library for interfacing with the YubiHSM, and eventually a lite Tendermint signer that can be deployed as a distributed system.

Ending Note

At StakeWith.Us, we continue to stand by our view that best practices for security and uptime will eventually be commoditized. We will continue to play our part by sharing our research and learning experiences with the wider community. We hope that this post has been helpful to all delegators and validators alike!

Signing off here — till next time!

May all your validation servers be safe and sound. Credits: 9gag.com and memegenerator.net

And also, wishing all validators to have a fabulous yet stable year!

Special thanks to Oliver Wee for co-authoring this article.

StakeWith.Us is a secure Staking-as-a-Service provider for leading blockchain projects. Put Your Crypto to Work — Hassle Free.

Here are our staking guides for Loom Network, Cosmos Network and Terra Network.

To get more updates on our validation updates, please follow StakeWith.Us on Twitter, Telegram and Medium. If you are interested to join our WeChat group for Loom, Cosmos or Kava, kindly approach gobigordietrying or mcry89 on WeChat. Also, subscribe to our monthly Staker Digest updates if you want to learn more about our staking projects!

Alternatively, reach out to Earn@StakeWith.Us if:

  • you have any burning queries for us;
  • you are a project looking for a professional validator;
  • you are looking into investment and partnership opportunities with StakeWith.Us.

--

--

Michael Ng
stakewith.us

Co-Founder @MWPartners and @StakeWithUs. Find me on twitter @maigoh91 . I try to learn new things everyday.