Creating Highly Available Nodes on ICON — Stage 1: Active/Passive Failover with Pacemaker and Corosync

The result of our testnet phase 2 stress testing simulation was devastating.

Image for post
Image for post

100% of the nodes were down, this is somewhat expected as all nodes were on a single bare-minimum docker container with testnet specs. The instances were no where near the recommended mainnet specs and most P-Reps have not yet created a HA architecture with some ways of failover or service recovery mechanism. With testnet phase 3 coming up, and full decentralization in less than a month, we should be ready to face extreme conditions and secure our network by keeping nodes alive.

Create P-Rep Node EC2 Instances

P-Rep nodes communicate through port 7100 for gRPC (peer to peer communication between nodes) and port 9000 for JSON-RPC API Server. Under Group Secuirty, create two custom rules for these ports and allow all IPv4 and IPv6 sources. Throw port 22 in so we can SSH into the servers to work directly.

We’ll be installing Corosync as our heartbeat and internal communication among cluster resources. Corosync uses UDP transport between ports 5404 to5406. Let’s enabled these ports as well

Image for post
Image for post

We’re aiming to build an Active/Passive configuration, having the passive node as a redundant failover node. For this basic setup we’ll need to create two instances, and an elastic IP address. Also, we’re going to purposely spin weaker instances, our ideal testing will be for the primary peer to fail and have secondary peer to recover the service during our testnet stress testing. Going up to the recommended specs, our primary node could possible handle all requests without any down time. We will scale up as soon as we’re done with testing the setup.

Once the EC2 instances are created, go to EC2 dashboard -> Elastic IPs. An elastic IP is a static IP address that points to one of your EC2. It allows you
to redirect network traffic to any of your instances when needed. This is used when we configure a domain or IP for our node peers to whitelist and making requests with each other. Now if one of our nodes goes down, we’ll dynamically point the IP to a different node, making service available again before the faulty node is restored. Let’s assign an elastic IP to one of the P-Rep nodes.


You can also setup a DNS record for this IP, this can be our peer endpoint to whitelist which will be used to exchange requests.

Image for post
Image for post

Install Corosync and Pacemaker

Next we’ll install Corosync (to both servers) as our messaging layer to the client servers, and Pacemaker as our cluster resource manager. Corosync is a dependency of Pacemaker so we’ll only need to install it.

We’ll also need to install a management shell, some prefer crm and some prefer pcs . Either one will work, install

Verify we have everything installed.

$ corosync -v
Corosync Cluster Engine, version '2.4.3'
Copyright (c) 2006-2009 Red Hat, Inc.
$ crm --version
crm 3.0.1
$ pcs --version

Configure Corosync

Next we’ll need to create an auth key for the cluster, install haveged on either one of the servers, and generate a key.

Copy the same key to PRep-02,

then on PRep-02 window, move the file to the corosync folder

Next we’ll define the corosync.conf file, to make configuration a bit more convenient, let’s jot down various instance IP addresses that we’ll need, namely public, private and elastic IP.

Now edit the file /etc/corosync/corosync.conf on both servers, the files are identical except the bindnetaddr parameter will be the working server’s private IP.

Your config should look something like this

quorum {
provider: corosync_votequorum
two_node: 1
nodelist {
node {
ring0_addr: PRep-01_private_ip
name: PRep-01
nodeid: 1
node {
ring0_addr: PRep-02_private_ip
name: PRep-02
nodeid: 2
logging {
to_logfile: yes
logfile: /var/log/corosync/corosync.log
to_syslog: yes
timestamp: on
service {
name: pacemaker
ver: 1

then start corosync on both servers

Verify that our nodes have joined as a cluster,

Image for post
Image for post

then start pacemaker

Image for post
Image for post

Our nodes should be online. Since we’re running a two node setup, both STONITH (a mode to remove faulty nodes) and quorum policy should be disabled.

verify the configuration,

Image for post
Image for post

Configure AWS CLI

We will be using AWS CLI for the elastic IP reallocation, for this we first need to install the CLI executables and configure a few settings, you will need AWS Access Key ID and AWS Secret Access Key

  1. Log in to your AWS Management Console.
  2. Click on your user name at the top right of the page.
  3. Click on the Security Credentials link from the drop-down menu.
  4. Find the Access Credentials section, and copy the latest Access Key ID.
  5. Click on the Show link in the same row, and copy the Secret Access Key.
Image for post
Image for post

Next we’ll need a resource agent to manage the elastic IP, we can use aws’s eip resource agent awseip, which is found in /usr/lib/ocf/resource.d/heartbeat/awseip

Also add, AWS_DEFAULT_REGION=<AWS-Default-Region> at the end of /etc/systemd/system/

then we’ll create a primitive resource for the agent to manage. A primitive resource is a singular resource that can be managed by the cluster. That is the resource can be started only once. An IP address for example can be primitive and this IP address should be running once and once only in the Cluster

your_elastic_ip is the elastic IP we allocated and associated to PRep-01 earlier, its allocation ID can be found under EC2 Dashboard -> Elastic IPs -> Allocation ID. Check the status again,

Image for post
Image for post

The elastic-ip resource should be started on our first peer node. At this moment, we have an active node (PRep-01), a passive node (PRep-02) and an elastic IP pointing to the active node. Whenever our node becomes inaccessible, the resource agent should automatically point the floating IP to the backup node. Let’s test this.

On a 3rd instance (I am using local computer here), curl the elastic IP address

Which should constantly pull the content of the index page. For testing purposes, modify the index page, (for nginx the default index page is /var/www/html/index.nginx-debian.html) with instance number and IP. Let’s simulate active node down time by,

You may or may not see service interruption like this message on the 3rd terminal

But either way, you should see the curl to the elastic IP now showing contents from the backup instance within a few seconds.

Image for post
Image for post

Also if you go back to the EC2 dashboard, you’ll notice the elastic IP has been automatically reassigned to the 2nd instance. You can bring the node back up via

At this point we have completed a basic failover setup in active/passive configuration, with a floating IP automatically reassigned when the active node goes down. Stay tuned for stage 2, where we’ll be configuring ICON’s citizen nodes, P-Rep nodes and nginx reverse proxy configurations!

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store