Creating a Cosmos Validator from 0 to Hero

with Cosmovisor, Prometheus, Grafana, Yubihsm 2, and TMKMS!

rMalakaib
Oregon Blockchain Group
29 min readFeb 20, 2023

--

Single handily, this has been the largest project of my life, except for the Interchain Federalist Papers. I’ve put five months of consistent discipline behind learning command line, the internal operations of computers, how to validate, and writing this. Over 500+ hours went into this, and I’m happy to present it to you.

https://www.reddit.com/r/cosmosnetwork/comments/lkl3yx/welcome_to_cosmos_network/

The continual subsistence of public blockchain depends on our ability to educate new and novice individuals. Anyone who wishes to contribute a verse to ledger tech should have the resources to do so. Currently, this is not the case.

Over the last five months, I’ve experienced what it is like to be a “novice individual” learning how to run node infrastructure for blockchain. While the documentation does its best, I noticed instantly that all of it was written by engineers who had an in-depth understanding of the technology. Rightfully so, who else would write it? As I’m sure you’ve surmised, it is challenging for newcomers to operate a node. Typically, the documentation writers assume the reader has pre-requisite knowledge of node operation.

I make no assumptions in these Validator Docs with the goal for anyone to understand and run a validator on chain!

I owe a huge thank you to Simply Staking and Gianluca Cremona for answering all my questions, as well as, Roea Mortaki from Quicksilver, Valeria Salazar from Archway for taking a chance on me, and Deborah Simpier from Althea Network for pushing me to run a node, without them these papers would not be possible.

As always, I am no expert and love to learn. If you have constructive feedback, I’m open to it as long as it’s done in a kind manner. I’ve done my best to be certain of the content in here, yet inevitably something will be incorrect.

I’m a part of Oregon Blockchain Group, a student-run organization at the University of Oregon. Our objective is to validate on mainnet to teach students how to work in real-world environments, fund student blockchain education programs, fund learning experiences like conferences, and generate new talent. We’d love to talk if you’re interested in Oregon Validation Group running on your chain.

Agenda:

  1. Install Ubuntu — link
  2. Build and Install Validator
  3. Set up Cosmovisor — to update the validator
  4. Install Grafana and Prometheus — to monitor the validator
  5. Install yubihsm2 and tmkms hard sign — to secure the validator

(Disclaimer: OVG is separate from the University of Oregon but works closely with Oregon Blockchain Group, and teaches only OBG members)

By Robert Burkhart licensed under creative commons attribution 4.0 international.

Anyone can copy and paste into a terminal. It is my objective to have you understand what you’re doing.

This documentation will be for the Junox testnet but can be applied to most cosmos chains and requires a linux OS. Paste commands directly into the terminal and edit them. We made an effort to keep the information as concise as possible, and you’ll notice still these papers are quite long.

Preparing your machine:

Sudo translates to super user. Imagine it like royalty in a castle. Some people tend to the farms and only execute those tasks. Within our castle’s system, message carriers notify the masses when the royalty has come to a decision on what to do with their collective resources. The monarchs have supreme power over the land and dominion over the people. Liken to a computer sudo allows us to make decisions as the monarchs. With immense power, we must be careful what we enter into sudo because we could screw up the works.

For sudo, what you type is invisible. All you need to do is type your password and hit enter.

Installing Dependencies

#Begin by opening a terminal with ctrl + alt + t

sudo apt-get update

sudo apt-get upgrade -y

sudo apt-get install build-essential -y

sudo apt install curl

sudo apt install jq -y

sudo apt install git -y

sudo apt install vim -y

sudo apt-get update

Above, we are updating package dependencies and installing new packages such as jq, which reads JSON, git, which pulls remote repos, and vim, which is an editor. We’ll need all of this for the validator. It was a design decision of these docs to keep the manual editing of configs minimal. You’ll only need to use vim twice.

Installing GOLANG

sudo wget https://golang.org/dl/go1.18.linux-amd64.tar.gz

sudo tar -xvf go1.18.linux-amd64.tar.gz

rm -rf go1.18.linux-amd64.tar.gz

Tar Breakdown: We’re taking a tar file and using the tar command with -xvf flags to extract the archive. The archive is extracted from the tarball with x. We display the information with v, and create an archive of the tarball filename with f. If you’ve ever zipped a directory (folder) or unzipped it, this is what we’re doing here.

sudo mv go /usr/local

mkdir -p ~/go/{bin,src}

echo "GOROOT=/usr/local/go" >> ~/.bashrc

echo "GOPATH=~/go" >> ~/.bashrc

echo "PATH=$PATH:$GOROOT/bin:$GOPATH/bin" >> ~/.bashrc

source ~/.bashrc

echo $GOROOT, $GOPATH, $PATH

Go Path Breakdown: Next, using mv we’ll move the go directory we just instantiated to the correct path at /usr/local/.

We use mkdir to make a directory. The -p flag translates to create all subdirectories. For example, mkdir -p /test/sub_directory will create both test and sub_directory while mkdir -p /test/{sub_directory,sub_directory_2} will make the primary directory test with two parallel sub_directories.

Next, we set the pathing variables. The goal is to tell GOLANG where it can find the binary required to run. We export those variables to a bashrc file, a script file executed when a user logs in. The function of the basrhc file is that it sets our systems environment variable.

The binary is a chain’s software that validates transactions and blocks of the chain.

Installing Juno Testnet:

git clone https://github.com/CosmosContracts/juno && cd juno

git fetch

git checkout v13.0.0-beta.2

make install

make && cd ~

#Ignore errors

We’re using git clone to make a copy of the Juno chain. We need to do this because a maker file in the repo allows us to compile binaries by changing the git branch. Git checkout navigates between branches. Make install builds the binary and places it in go/bin.

They’re a lot of learning curves with validation. Still, the one that caused the most annoyance is that every chain has a different method of installing its binary and repositories. Some chains make it easy, like Juno, while others make it more difficult. If you don’t find a make install for the chain you’re trying to download the binary for, you need to find the binary and directly install it in your go/bin folder. As explained later, if this is the case you’ll need to install the binary manually in your cosmovisor/genesis/bin directory.

But we can ignore this because Juno makes it easy with their maker file.

Now that the Binary is compiled into our system, we’ll create a name for our validator. Create something unique with no spaces at all, then replace and delete <name of node> (the arrows too). The name you choose is public-facing.

echo "MONIKER_JUNOX=<name of node>"  >> ~/.profile

We have our name and binary. We’ll then save our host address to utilize later. The $USER prompt returns the username of your machine.

echo "HOST=$(hostname -I | awk '{print $1}')" >> ~/.profile

source ~/.profile

echo $USER, $MONIKER_JUNOX, $HOST

Preparing your validator:

We’ll initiate our journey by creating a genesis file and replacing it with the correct version. First, what is a genesis file?

Genesis File Breakdown: A genesis file defines the initial state of the blockchain, like initial token allocation, genesis time, and default parameters. For more detailed information, head here.

The follow-up question is, why create a genesis file and then replace it? I needed clarification about this as well.

This seemingly redundant operation is because when you create your own genesis file, it also initializes an app.toml, client.toml, and config.toml, which is incredibly important. With these toml files we are able to change our validator.

The tree of our files looks as such:

In our config the three main files we should understand are app.toml, config.toml, and client.toml.

  • Config.toml contains all the configurations for our node
  • App.toml contains all the specifications for how we handle the state of the chain we validate
  • Client.toml defines the methods in which we export consensus operations, store keys, and broadcast transactions.

In our data directory, we store application data, priv_validator_state.json and much more.

  • Application and blockstore database is axiomatic
  • The priv_validator_state.json stores what round and what step in that round the last signing operation occured.

We’ll then install the new genesis file and remove the old one.

junod version
# should return v13.0.0-beta.2

junod init $MONIKER_JUNOX --chain-id uni-6

wget -O genesis.json https://snapshots.polkachu.com/testnet-genesis/juno/genesis.json --inet4-only

rm -r ~/.juno/config/genesis.json && mv genesis.json ~/.juno/config

shasum -a256 ~/.juno/config/genesis.json

Compare the hash you receive from the shasum command below to this: 4c90e8bfce9b7fab824b923cbb7bdddf276f1040f3651fdb5304a4289147ea90

They should be matching!

Now that we have installed the correct genesis file, we’ll need to sync our node to the chain’s current block height. There are three ways to do this, the first is tedious, and the second/third involve trust but are multiple times more efficient.

The First Option to Sync:

Syncing to the chain from 0. Depending on the height of the ledger, this may be a viable option, but it could take days or even weeks if you have a tall blockchain like the Cosmos Hub. To do this, you need to run the correct binary for the proper range of blocks.

  • For example: Chain A’s height is 100,000 at block height 25,000, the chain upgraded to binary version 2.0.0 from 1.0.0.
  • Then at block height 75,000, the chain upgraded to binary version 3.0.0 from 2.0.0, and its current binary is 3.0.0.
  • To sync correctly, we must change our local binary at every height where a step shift in binary took place.

Ledgers function deterministically. If we sync with the wrong binary at the improper range, we will result in an App Hash error. An App Hash error is an unrecoverable error and happens when you corrupt your local database at ~/.juno/data. To recover your node, you must do tendermint unsafe-reset-all (DON’T ENTER THIS COMMAND; WE’RE JUST EXPLAINING). Understandably while this method requires minimal trust, for new node operators to perform these tasks correctly is challenging.

For a minute, we digress. The primary difference between the other two solutions is syncing from 0 fetches and applies all historical blocks.

The ledger is broken into two pieces:

  • required data for a node to be a valid instance of the chain to make the next block
  • the state which is everything that must be permanent.

We don’t have to store all state of the chain, only the data required to be a valid instance of the chain. The difference here is called archival or pruned.

  • Archival Node: has all historical state, saves IBC transactions, and contains the information needed to be a valid node of the chain.
  • The Pruned Node: contains only the information needed for a node to be a valid instance of the chain.

In summation, syncing from 0 could take weeks but comes with the added benefit of archival data. In contrast, the second and third solutions exist to hasten the process with the disadvantage of being pruned.

The Second Option to Sync:

Syncing to the chain through a snapshot. A snapshot is the minimal application state needed to sync to the chain. The node will contain no historical state from previous heights, and the snapshot will be added to the database located at ~/.juno/data. To attain the snapshot, you have to download it from a third party.

The Third Option to Sync:

State-sync is very similar to a snapshot in that they fetch the same data. The two processes go about it in different ways. State-Sync fetches data through peers on the chain versus the second option.

When state-syncing, our node must verify the light client it is pulling from. It’ll need an RPC server, a trusted height, and a Block ID hash of the trusted height.

How Does State-Sync Apply:

To explain the script below.

  • We are utilizing Polkachu’s RPC called SNAP_RPC
  • At BLOCK_HEIGHT we’re querying the height data from the chain and subtracting 2000 most likely because that is the snapshot interval
  • After executing those calls, we query the RPC endpoint at the snapshot blockheight for the TRUST HASH.
  • Everything after the term sed is data being written to the config.toml file at target directory ~/.juno/config.

You’re going to type vim state_sync.sh, creating a blank text script file. Hit i, paste in all the information below, hit escape twice, type :wq, and press enter.

vim state_sync.sh
#!/bin/bash


SNAP_RPC="https://juno-testnet-rpc.polkachu.com:443"


LATEST_HEIGHT=$(curl -s $SNAP_RPC/block | jq -r .result.block.header.height); \
BLOCK_HEIGHT=$((LATEST_HEIGHT - 2000)); \
TRUST_HASH=$(curl -s "$SNAP_RPC/block?height=$BLOCK_HEIGHT" | jq -r .result.block_id.hash)


sed -i.bak -E "s|^(enable[[:space:]]+=[[:space:]]+).*$|\1true| ; \
s|^(rpc_servers[[:space:]]+=[[:space:]]+).*$|\1\"$SNAP_RPC,$SNAP_RPC\"| ; \
s|^(trust_height[[:space:]]+=[[:space:]]+).*$|\1$BLOCK_HEIGHT| ; \
s|^(trust_hash[[:space:]]+=[[:space:]]+).*$|\1\"$TRUST_HASH\"|" $HOME/.juno/config/config.toml

Now that we have written our script to marshal all the specifications we need into our config.toml we must add permissions for our file to write and execute.

chmod +x state_sync.sh

Concluding with the state sync setup, we will run the file.

sh -v state_sync.sh

Setting Up Cosmovisor and Finalizing Our Node:

We’re getting to the fun parts now!

Below you’ll find a command called sed which stands for stream editor. We combine this with regular shorthand expressions to express many characters of different types. The -i flag allows in-place editing. In non-engineer speak, that translates to if I have the word “DOG” in a file, then I can search for “DOG” and edit it in place to “CAT”. Be careful the -i flag deletes the original file and replaces it with a new copy. With -i.bak, we create a .bak file with the original file contents.

To continue, you’ll notice -e as another flag equates to normal regular expression, and -E equates to extended regular expression. We leave understanding the syntax of regular expressions to you and carry on with setting up our node.

sed -i.bak -e "s/^persistent_peers *=.*/persistent_peers = \"f27f3afa10b62cb7f241f4f958fc3fd2e46a9f27@65.109.23.114:12656,e0fbaf1ef89afad23444e67b334bdf78a4b598fd@65.108.71.92:52656,edc35b09613096598e20f8508c977806093d7eec@194.61.28.217:26656,ed90921d43ede634043d152d7a87e8881fb85e90@65.108.77.106:26709,ec41af656b3450050ae27559b66b877373c44861@65.21.122.47:26656,4a91597dfe3ec715bbf6def225066fbb6ad86cfe@207.180.204.112:36656,f79ce2fab55e56b408d76ddcbc1c82c1a90e315b@54.74.146.114:26656,51f9e32a76d738c51dfa353917cef10729b6a600@161.97.118.84:26656,d81758e6a9044c6247a3ff70e29d4a86ff1a46fc@65.109.90.33:12656,51f7c671de697f6cc7d12f0485592f288c27a408@65.108.138.80:12656\"/" ~/.juno/config/config.toml

Above, we are altering our config file to contain a list of peers. At the node’s initialization, it will dial (call) a set of seed or Persistent peers.

  • Seed: Connects and queries other peers’ address books to update the persistent peers in the node’s local address book and then disconnects.
  • Address Book: Contains a list of peers the node tries to connect with.
  • Persistent Peers: the peers that the node makes an effort to connect to permanently. If the connection fails, the node will redial. Peers are used to pass information to or receive information from.

Now we use stream editor and extended regular expression to change our minimum gas price from 0 to a denomination of junox. As far as we know, this is standard practice for validators because if all validators required 0 gas cost, the blockchain could be Distributed Denial of Service attacked.

sed -i.bak -E "s/^minimum-gas-prices *=.*/minimum-gas-prices = \"0.0025ujunox\"/" ~/.juno/config/app.toml

We’ll narrow our focus to setting up cosmovisor by starting with installing the cosmovisor repo.

You’ll notice that these exports do not include a write to our ~/.profile file, which means that if we were to close our terminal, we’d lose these environment variables. Keep your terminal open. To educate you through the variables we are implementing:

  • the DAEMON_NAME is the name of our binary
  • DAEMON_HOME is the location of the config and data directories
  • DAEMON_DATA_BACKUP_DIR backs up the data from cosmovisor.
go install github.com/cosmos/cosmos-sdk/cosmovisor/cmd/cosmovisor@v1.0.0
# This will take a while just be patient

export DAEMON_NAME=junod

export DAEMON_HOME=$HOME/.juno

export DAEMON_DATA_BACKUP_DIR=/root/backups

echo $DAEMON_NAME, $DAEMON_HOME, $DAEMON_DATA_BACKUP_DIR

mkdir -p $DAEMON_HOME/cosmovisor/{genesis/bin,upgrades}

cp $(which junod) ~/.juno/cosmovisor/genesis/bin/

cosmovisor version
#ignore errors

The following steps are to create a system service file. A service file and a unit file are interchangeable words. They utilize unit configuration to encode information to execute a process that is controlled by systemd. A unit is any entity managed by systemd. Systemd operates process, network, and login management.

We path it to etc/systemd/system because this is where custom unit files exist. Etc (said as et cccc) has the highest priority and overrides other unit files in different directories.

  • [Unit] contains a description and network service.
  • [Service] holds environment variables, the process you intend to start with ExecStart, the restart parameters, and LimitNOFILE which sets a soft upper limit to the number of files open per operation.
  • [Install] when the system is up, and multiuser login is available, we desire the unit file we created to be started. [Install] states boot is finished once this unit/service file is started.

Our service file with Unit, Service, and Install is below. The tee command reads the input we’re passing and writes the standard output and standard error to /dev/null. We may be wrong, but we believe that the standard output is written to /dev/null/ and copies to junoval.service. Would love for anyone to fact-check.

Enter the command below:

sudo tee /etc/systemd/system/junoval.service > /dev/null << EOF
[Unit]
Description=Junox Testnet Validator
After=network-online.target
[Service]
User=$USER
ExecStart=/home/$USER/go/bin/cosmovisor start
Restart=always
RestartSec=3
LimitNOFILE=4096
Environment="DAEMON_NAME=junod"
Environment="DAEMON_HOME=$HOME/.juno"
Environment="DAEMON_ALLOW_DOWNLOAD_BINARIES=false"
Environment="DAEMON_RESTART_AFTER_UPGRADE=true"
Environment="DAEMON_LOG_BUFFER_SIZE=512"
[Install]
WantedBy=multi-user.target
EOF

Daemon-reload forces systemd to reload all unit files. Systemctl enable ensures that our service file starts on boot.

systemctl daemon-reload

systemctl enable junoval.service

systemctl restart junoval.service

The key ring backend holds all our private information about the keys we create. There is an issue with nodes using the OS backend on linux, so we’ll switch it to a file. We’ll also set the chain-id in our client.toml.

sed -i.bak -E "s|^(keyring-backend\s*=\s).*$|\1\"file\"|" ~/.juno/config/client.toml

sed -i.bak -E "s|^(chain-id\s*=\s).*$|\1\"uni-6\"|" ~/.juno/config/client.toml

With the creation of our service file, and the setup of our node almost complete we can add a key.

The following –keyring-backend file flag will prompt you to enter your keyring-backend password. When entering the password for the keyring, it is invisible. Just hit enter after you’ve typed the password in.

The junod command will generate a key with a mnemonic . DO NOT LOSE THIS, and write/save it in a secure location.

junod keys add junokey --keyring-backend file

If the following curl command returns a Catching up true message, enter the command again every minute until it returns Catching up false.

curl -s localhost:26657/status | jq .result | jq .sync_info

Once your node is caught up, we can begin Initializing your Validator after we explain cosmovisor.

How Does Cosmovisor Work?

Sidenote these actions shouldn’t be performed now, but in case you need to update your node, you’ll want to know them.

There are two types of chain upgrades binary and chain version&binary upgrades. We’ll elaborate on only the former in these papers due to the latter's greater complexity.

The cosmos SDK has multiple modules. The one we desire to focus on is the Upgrade Module. An upgrade has an upgrade proposal which can be queried as such: junod q gov proposal <proposal_ID>. The proposal ID is the number assigned to the proposal.

We provide an example below:

You’ll notice that the proposal ID is 1, so our input would look like this: junod q gov proposal 1. Our operation will return:

Our question is, how do we upgrade the chains binary using cosmovisor?

Located at ~/.juno/cosmovisor/upgrades we need to make a folder with the EXACT name as the proposal plan above.

  • We’ll do: mkdir -p ~/.juno/cosmovisor/upgrades/v12/bin && cd ~/.juno/cosmovisor/upgrades/v12/bin

Now that your working directory (the directory you’re currently in) is bin, you’ll manually download the binary into this folder. To find the new binary that needs to be installed, enter this command into the terminal to view the upgrade plan: junod q upgrade plan.

  • For example, assuming we’re in ~/.juno/cosmovisor/upgrades/v12/bin and the binary were v12.tar.gz
  • We’d extract it with tar -xvf v12.tar.gz
  • Then rm -r v12.tar.gz
  • Next chmod +x v12
  • Now we’ve installed the new binary

When the chain hits the upgrade height at 23400 as specified above, cosmovisor will automatically switch binaries, and you’ll have to systemctl restart cosmovisor.service.

The next step is preference oriented. AFTER THE UPGRADE IS FINISHED, we remove the old binary located in ~/go/bin and copy the binary from ~/.juno/cosmovisor/upgrades/v12/bin to ~/go/bin.

That might be confusing, so let’s drill further down.

  • After the upgrade, we removed the old binary from ~/.juno/cosmovisor/genesis/bin.
  • We moved the new binary from ~/.juno/cosmovisor/upgrades/v12/bin to where our old binary used to be ~/.juno/cosmovisor/genesis/bin.

BUT recall that at ~/go/bin we still hold the old binary before v12 upgrade. If we ran junod q gov proposal 1 it would fail because the go binary is not updated to the v12 chains binary version. So we must change it to the new version as well.

As we stated, these actions are preference oriented because you could similarly utilize the cosmovisor command to execute junod tasks. We do this because every doc on how to run a validator uses the chains binary name versus cosmovisor. Hence we copy the new binary to the ~/go/bin to keep our commands consistent with others.

  • To perform this task, the commands are: rm -r ~/go/bin/junod && cp ~/.juno/cosmovisor/genesis/bin ~/go/bin

Onward creating your validator.

Initializing Your Validator

A validator is a node that proposes a block to the network of validators that run the protocol, executes transactions in a block, and commits the block. A validator can validate the network in two ways:

OPTION ONE: It sends a proposal of a block to the network.

  • Then a process known as prevote occurs where ⅔ of the network vote to prepare the network to commit the block.
  • Following the prevote, a precommit procedure occurs where the validator understands, since it received a prevote, the network is ready to vote to commit the block
  • The validator signs/broadcasts a precommit for said block.
  • In the end, ⅔ of the network votes to precommit, and your validator computes the state and commits the block.

OPTION TWO: it receives a block proposal from a different validator.

  • Then prevote occurs where ⅔ of the network vote to prepare the network to commit the block.
  • Following the prevote, it receives a precommit that tells us if it was valid or not (in more technical terms nil or non-nil).
  • If it is valid, we commit the block and continue to the next block.
  • If not, we will forward straight to the next block.

We can visualize this process below.

We’ll notice below that our validator received a proposal block, then confirmed the block is committed, executed the transactions, and fully committed.

Imaged below a granular sub-task is executed in the above more general process. That being the prevote and commit signatures. For every block, there is one prevote and one precommit.

Below, three blocks are being finalized. The first two blocks are non-nil, so we can observe some numbers at the end of the prevote and commit. The last prevote and precommit is nil, meaning that we (the network) decided to move on to the next block for possibly many reasons.

With both operations, “Tendermint ensures that no two validators commit a different block at the same height” (Tendermint: Byzantine Fault Tolerance in the Age of Blockchain, Ethan Butchman p.g. 30).

We understand how a validator functions, but how is it different from a regular network user?

A validator node is different from a user-sided wallet. The rules of how a key is generated are the same; it requires a prefix, bech address, and checksum.

A typical address could be juno1rps4vwvq74k73q4gewdd5n4eyvzpcsmhsdmslc.

  • Juno as the prefix
  • 1rps4vwvq74k73q4gewdd5n4eyvzpcsmhs as the bech address
  • dmslc as the check sum.

A validator node has three keys:

  • One for your account with the prefix juno, which holds the validators balance and right to claim validator rewards
  • Two, the validator key is used to create tx or events that modify validator parameters, and it looks like junovaloper1rps4vwvq74k73q4gewdd5n4eyvzpcsmh0sdlyp
  • Three, the valcons key is utilized to participate in consensus. The valcons key looks like junovalcons1eg40uhtznkznvujvv8wht26wlhe6uu8x7eqazc.
junod keys show junokey --bech acc | grep address

#Head to the discord channel at https://discord.gg/juno

Head to the discord channel, find the faucet under the developer section and do $request <address> using the address that was returned by the aforementioned command.

Create a validator node with this transaction:

junod tx staking create-validator \
--from=junokey \
--amount=1000000ujunox \
--moniker=$MONIKER_JUNOX \
--chain-id=uni-6\
--commission-rate=0.05 \
--commission-max-rate=0.5 \
--commission-max-change-rate=0.1 \
--min-self-delegation=1 \
--fees=500ujunox \
--pubkey=$(junod tendermint show-validator)

Display your validator address:

junod keys show junokey --bech val | grep address

Now that we have a validator node running, we will self delegate 2 junox to ourselves. MAKE SURE to replace the <junovaloper> with your validator address returned above.

junod tx staking delegate <junovaloper> 1000000ujunox --from junokey  --gas-prices 0.025ujunox --gas auto --gas-adjustment 1.3

#replace <junovaloper>

Congrats, you have just completed a validator!!! This is a great point to pause and take a break. One thing we’ve learned is that entering commands into terminal requires great focus and precision.

Setting Up Node Monitoring Services Prometheus and Grafana

Installing Prometheus and Node Exporter

Prometheus is literally a monitoring and alerting program. Prometheus is a server we run to collect data using scrape configurations which is a fancy way to say we collect a subset of data from the total aggregate by specifying what we’d like. Prometheus stores time-stamped data on disk which our grafana server will query. We’ll install the prometheus repo and extract it.

wget https://github.com/prometheus/prometheus/releases/download/v2.42.0/prometheus-2.42.0.linux-386.tar.gz

tar -xvf prometheus-2.42.0.linux-386.tar.gz

rm -r prometheus-2.42.0.linux-386.tar.gz

mv prometheus-2.42.0.linux-386 prometheus

Following we must change the prometheus configs in our validator config.toml to true so we can export our validator’s node data.

sed -i.bak -E "s|^(prometheus\s*=\s).*$|\1\"true\"|" ~/.juno/config/config.toml

Since we are running prometheus on the same machine as our validator node we have to change our validator’s gRPC address in our validator config.toml file. Prometheus runs on port 9090 which is the exact same address of gRPC. We can’t run two activities on one port so we must split them both by moving the gRPC to 9092.

sed -i.bak -e "s|^address *= \"0\.0\.0\.0:9090\"$|address = \"0\.0\.0\.0:9092\"|" ~/.juno/config/app.toml

systemctl restart junoval.service

We’ll execute the same actions to unpack the node_exporter that will be used to monitor the computer’s internal performance.

wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

tar -xvf node_exporter-1.5.0.linux-amd64.tar.gz

rm -r node_exporter-1.5.0.linux-amd64.tar.gz

mv node_exporter-1.5.0.linux-amd64 node_exporter

Since we have prometheus and the node_exporter operating we need to create jobs for prometheus to monitor. We’ve changed our config files for our validator we’ll need to adjust our prometheus server to aggregate data from our validator.

From top to bottom here is what the following prometheus config file is saying.

  • We determine an interval to scrape data
  • We determine the evaluation interval for prometheus to view the alerting rules.

Our main focus is under Scrape_configs. You’ll find the same schema (outline) for each scraper.

  • A job name that identifies the operation
  • Possibly a path to the metrics which we won’t use
  • The target we want to monitor.

The observant individual will notice that we are targeting ports 26660, 9100, and 9090. Our data for our validator is on 26660, for our node exporter is on 9100, and finally for our prometheus 9090.

tee ~/prometheus/prometheus.yml > /dev/null << EOF
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).


# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"


# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "node"


# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.


static_configs:
- targets: ["localhost:9100"]


- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# cosmos monitoring
- job_name: 'Validator'
static_configs:
- targets: ['localhost:26660']
labels:
group: 'Validator'
EOF

After we’ve configured our prometheus.yml we need to write a service file to run prometheus as a background process and start it on machine boot. We’ll set our network service target to initialize our prometheus.service file after the network is online. Our ExecStart will call the prometheus binary with flags pointing to the config we just altered, storage, and more.

sudo tee /etc/systemd/system/prometheus.service > /dev/null << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=$USER
Type=simple
ExecStart=/home/$USER/prometheus/prometheus \
--config.file /home/$USER/prometheus/prometheus.yml \
--storage.tsdb.path /home/$USER/prometheus/ \
--web.console.templates=/home/$USER/prometheus/consoles \
--web.console.libraries=/home/$USER/prometheus/console_libraries
Restart=always
RestartSec=3
LimitNOFILE=4096
[Install]
WantedBy=multi-user.target
EOF

Then we’ll need to initiate our prometheus service file with:

systemctl daemon-reload

systemctl enable prometheus.service

systemctl restart prometheus.service

We’ve implemented Prometheus and a service file to run it. Next, on the docket, we have to write a service file for the node_export.

sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node_Exporter
After=network-online.target
[Service]
User=$USER
ExecStart=/home/$USER/node_exporter/node_exporter
Restart=always
RestartSec=3
LimitNOFILE=4096
[Install]
WantedBy=multi-user.target
EOF

Then we’ll need to initiate our node_exporter service file:

systemctl daemon-reload

systemctl enable node_exporter.service

systemctl restart node_exporter.service

Data is currently being scraped from ports 26660 and 9100, but it isn’t useful to us yet. To make it useful, we require a visualizer such as Grafana. Grafana is our one-stop shop for data analytics. We will install a local grafana server to run our local grafana browser.

Install Grafana

sudo apt-get install -y adduser libfontconfig1

wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.3.6_amd64.deb

sudo dpkg -i grafana-enterprise_9.3.6_amd64.deb

systemctl daemon-reload

sudo systemctl enable grafana-server.service

systemctl start grafana-server

Our Grafana local server is running now. Open a web browser but keep open your terminal! In your terminal paste:

echo $HOST

#Paste the return from our $HOST command in the browser and add :3000 at end
#It should look similar but not the same as 190.162.1.9:3000
  1. Once you see the grafana screen type admin for username and admin for password
  1. Navigate to the settings and click data sources (1)
  2. Add data sources (2)
  3. Click prometheus as a data source
  1. As seen below name your data source what you’d like (1).
  2. Paste this into URL http://localhost:9090 (2).
  3. Change the http method under Alerting to GET (3).
  4. Click save and test (4).

Navigate your way to Dashboard then import.

  1. Copy and paste the Grafana Dashboard ID 11036 and click prometheus for data source then load to complete importing
  2. Navigate your way to Dashboard then import
  3. Again, Copy and paste the Grafana Dashboard ID 1860 and click prometheus for data source then load to complete importing and
  4. Head to the dashboard panel and click on your new dashboards!!

Awesome, we have a validator running with a monitoring service and cosmovisor. You could implement tmkms soft sign, but that is out of the scope of these papers.

IF YOU DON’T HAVE A YUBIHSM2 THIS IS WHERE YOUR JOURNEY ENDS.

Setting up YUBIHSM 2 and TMKMS

Yubihsm 2

Yubihsm 2 is a Hardware Security Module (HSM) which is a fancy way to say it stores keys in a physical location separate from the computer. HSMs provide hashing and asymmetric/symmetric key cryptography to protect sensitive information.

sudo apt update

sudo apt install build-essential

sudo apt install libusb-1.0-0-dev -y

gcc --version
#should return somthing similar to gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

We’ll need to install rust and set its path. Proceed with the installation by typing 1 and entering.

curl --proto '=https' --tlsv1.3 https://sh.rustup.rs -sSf | sh

echo "PATH=$PATH:/home/$USER/.cargo/bin" >> ~/.profile

source ~/.profile

To know the ubuntu version of our system so we’ll enter:

lsb_release -a

Go to https://developers.yubico.com/YubiHSM2/Releases/ and copy the link that corresponds to your ubuntu version. Below is the version we’re running. If you run a different ubuntu version, you’ll need to replace the command with wget <your-link>.

wget https://developers.yubico.com/YubiHSM2/Releases/yubihsm2-sdk-2022-06-ubuntu2204-amd64.tar.gz

sudo tar -xvf yubihsm2-sdk-2022-06-ubuntu2204-amd64.tar.gz

rm -r yubihsm2-sdk-2022-06-ubuntu2204-amd64.tar.gz

sudo -s

You’ll notice above we’re using sudo, but it is slightly altered. The -s flag allows us to run our terminal as root, meaning that we’re acting as the royalty in our analogy described at the beginning of these papers. We require the power involved with sudo because we’ll be making changes to the udev rules of our system. Udev manages the system’s kernel. The linux kernel is the interface between the computer’s hardware and its processes.

cd yubihsm2-sdk

apt install ./libykhsmauth1_2.3.2_amd64.deb && apt install ./libyubihsm-http1_2.3.2_amd64.deb && apt install ./libyubihsm1_2.3.2_amd64.deb && apt install ./libyubihsm-usb1_2.3.2_amd64.deb && apt install ./yubihsm-auth_2.3.2_amd64.deb && apt install ./yubihsm-pkcs11_2.3.2_amd64.deb && apt install ./yubihsm-shell_2.3.2_amd64.deb && apt install ./yubihsm-wrap_2.3.2_amd64.deb && apt install ./yubihsm-setup_2.3.1-1_amd64.deb && apt install ./yubihsm-connector_3.0.3-1_amd64.deb

Ignore the error. The dependencies we need for the yubihsm 2 are installed. The order of the dependencies being installed matters. Do not change them!

Below we’ll make a copy into etc/udev directory, which will overwrite the rule set in lib/udev directory.

ls /lib/udev/rules.d/

cp /lib/udev/rules.d/70-yubihsm-connector.rules /etc/udev/rules.d/

For the following command change OWNER=”yubihsm-connector” to OWNER=”YOUR SYSTEMS USERNAME” like this:

After performing the following command, hit i to edit, enter your username, next hit esc twice, and type :wq, and then enter.

vim  /etc/udev/rules.d/70-yubihsm-connector.rules

We’ll reset the udev rules, and we’re good. We need to change the udev rules because we’re using a service account to run our validator instead of a root user account. To access the yubihsm 2 with the current rules, we’d have to be in super user or sudo. After changing the access rules, we’ll be able to operate the yubihsm 2 from our service account, in other words, USER account.

udevadm control --reload-rules && udevadm trigger

exit

To initialize our connector to the yubihsm 2 we need the serial number, which we’ll find through the operation below.

lsusb

After finding the Bus and Device of your yubihsm 2 enter those in the below prompt. Ours would look like this: lsusb -s 001:008 -v | grep iSerial

lsusb -s bus:device -v | grep iSerial

#change bus and device!

Write down yubihsm serial, which looks similar to but not the same as 0050160574 and is 10 digits long. Type that written down serial number where <serial> is below.

tee ~/hsm_list.config > /dev/null << EOF
connector_path=/usr/bin/yubihsm-connector
hsm1_serial=<serial>
hsm1_listen=0.0.0.0:2222
EOF

#change <serial> to your serial number

We’ll use a start up script from here to initialize the connector.

vim ~/startup_hsm.sh 

After performing the above command, hit i to edit, paste in the info below, next hit esc twice, and type :wq, and then enter.

#!/bin/bash
configfile='./hsm_list.config'
. $configfile
hsmcount=$(grep -c 'serial' $configfile)
for (( c=1; c<=$hsmcount; c++))
do
hsmserial="hsm${c}_serial"
hsmlisten="hsm${c}_listen"
$connector_path --listen ${!hsmlisten} --serial ${!hsmserial} &
done

We’ll give the script permission to read and write then execute the script.

chmod +x startup_hsm.sh

./startup_hsm.sh

We’ll confirm the yubihsm is functioning:

sudo yubihsm-connector install #ignore the error for this, it's a validity check
ps -ef | grep yubi

The Yubihsm 2 is fully setup now, and we’ll continue to TMKMS

Tendermint Key Management Service

Before explaining how to set up TMKMS we should explore why we want to utilize a HSM paired with TMKMS in the first place. For cosmos blockchains, all validators in the active set submit block proposals and vote on them, known as prevotes and precommits. We explained these more in-depth at the beginning of these papers.

From a broader perspective, Delegate Byzantine Fault Tolerant networks require that a block receive ⅔ of votes to finalize. According to the Comet BFT documentation and old Tendermint docs, the assumption is that these networks would be safe from forks because it is not possible to have 2 times ⅔ votes.

Yet, double voting allows byzantine proposers to create two conflicting blocks which fork the network. Network partitions are thus the enemy of dBFT and must be avoided and punished. How does this apply to HSM and TMKMS?

In a scenario where your validator has been overtaken, someone other than you has access to your machine. If your priv_validator_key.json is stored as the default configuration at ~/.juno/data/priv_validator_key.json your validator is compromised. The malicious actor could double-sign on your behalf, and you’d be screwed!!!

A HSM will not allow the private keys it stores to be extracted but can use the private keys it stores to sign. TMKMS asks the HSM to sign. TMKMS enacts double signing protection and communicates to the HSM through an encrypted channel to prevent man-in-the-middle attacks. If your host becomes compromised, TMKMS will prevent double signing.

We’ll need to download and extract tmkms, initialize the service, direct tmkms to work with a yubihsm, and write to tmkms.toml.

git clone https://github.com/iqlusioninc/tmkms.git && cd tmkms

cargo build --release --features=yubihsm

mv ./target/release/tmkms ~/.cargo/bin

tmkms init ~/tmkms/

cargo install tmkms --features=yubihsm-server --force

rm -r tmkms.toml

We have completed every step needed to write the tmkms.toml. Below is a configuration for the file.

Beginning our explanation at [[Chain]]

  • We’ll find our chain-ID
  • The key format which is usually bech32

Recall when we spoke of the three addresses, account, valoper, and valcons for our validator. You’ll notice the account_key_prefix is called junopub versus that of the account address prefix juno. Similar logic applies to the junovalcon key prefix. The distinction here is easily overlooked but is incredibly important. After the key_format comes the state_file.

  • The state_file saves what round and what step in that round the last signing operation occured. The state_file is what manages state transitions.

Moving onward to [[providers.yubihsm]] here we are defining the requisite parameters to access our yubihsm 2.

  • Under #change keys here, we provide the Object ID and its password, which is defaulted to password, to use the yubihsm 2.
  • Continuing downward, you’ll find keys with the object ID, and chain id.
  • The connector_server we’ll be at the same address and port for any implementation of TMKMS.

Under [[validator]] we’ll see the:

  • Chain id
  • The address for the validator

The address refers to the local port of our validator named priv_validator_laddr where we will send/receive tmkms information through. Following the addr, we’ll notice a:

  • secret_key which is used to authenticate the yubihsm 2 to the validator
  • Succeeding secret_key is protocol_version.

Protocol_Version is the instance of tendermint (comet BFT) that the chain is running on.

tee ~/tmkms/tmkms.toml > /dev/null << EOF
# Tendermint KMS configuration file


## Chain Configuration


### Cosmos Hub Network


[[chain]]
id = "uni-6"
key_format = { type = "bech32", account_key_prefix = "junopub", consensus_key_prefix = "junovalconspub" }
state_file = "/home/$USER/tmkms/state/priv_validator_state.json"
## Signing Provider Configuration


### YubiHSM2 Provider Configuration


[[providers.yubihsm]]
adapter = { type = "usb" }
#change keys here:
auth = { key = 1, password="password"}




#serial_number = "0123456789" # serial number of a specific YubiHSM to connect to (optional)
keys = [
{ key = 1, type = "consensus", chain_ids = ["uni-6"] },
]
connector_server = { laddr = "tcp://127.0.0.1:12345"}


## Validator Configuration


[[validator]]
chain_id = "uni-6"
addr = "tcp://127.0.0.1:26659"
secret_key = "/home/$USER/tmkms/secrets/kms-identity.key"
protocol_version = "v0.34"
reconnect = true
EOF

Now we need to generate our keys.

tmkms yubihsm setup

This command will return something that appears like this

  • key 0x0001: admin: double section release consider diet pilot flip shell mother alone what fantasy much answer lottery crew nut reopen stereo square popular addict just animal
  • authkey 0x0002 [operator]: kms-operator-password-1k02vtxh4ggxct5tngncc33rk9yy5yjhk
  • authkey 0x0003 [auditor]: kms-auditor-password-1s0ynq69ezavnqgq84p0rkhxvkqm54ks9
  • authkey 0x0004 [validator]: kms-validator-password-1x4anf3n8vqkzm0klrwljhcx72sankcw0
  • wrapkey 0x0001 [primary]: 21a6ca8cfd5dbe9c26320b5c4935ff1e63b9ab54e2dfe24f66677aba8852be13

DO NOT SAVE THESE KEYS ON YOUR COMPUTER instead write them down and put them in a safe location.

We’ll access our tmkms.toml file and change the auth key password to our mnemonic, which should look something like this:

Make certain that it is entered correctly and is on one line! Perform the command and hit i to edit, then paste in your admin mnemonic, next hit esc twice, and type :wq, and then enter.

vim tmkms.toml

Now we’ll import the private validator json to the yubihsm.

tmkms yubihsm keys import -i 1 ~/.juno/config/priv_validator_key.json

We’ll create the state file we mentioned when forming the tmkms.toml file by performing the action below. A frustrating realization is that this priv_validator_state.json is formatted differently than the one located at ~/.juno/data but that is just a nit pick.

tee ~/tmkms/state/priv_validator_state.json > /dev/null << EOF
{
"height": "0",
"round": "0",
"step": 0,
"block_id": {
"hash": "",
"part_set_header": {
"total": 1,
"hash": ""
}
}
}
EOF

A service file for tmkms to run on boot is needed. Below, the only difference from other service files we’ve written is we start the binary tmkms with start and then a flag -c to tell tmkms where the toml config is.

sudo tee /etc/systemd/system/tmkms.service > /dev/null << EOF
[Unit]
Description=Tendermint Key Management System
After=network-online.target
[Service]
User=$USER
ExecStart=/home/obg/.cargo/bin/tmkms start -c /home/$USER/tmkms/tmkms.toml
Restart=always
RestartSec=3
LimitNOFILE=4096
[Install]
WantedBy=multi-user.target
EOF

We’ll initiate our tmkms service file with the following:

systemctl enable tmkms.service

systemctl daemon-reload

systemctl stop junoval.service

We have been waiting to say this word: Finally!!

We’ll comment out the private validator key file since we’ll be using the private validator key we exported to yubihsm. Lastly, we’ll set the private validator ladder to listen on port 26659, so tmkms can communicate.

sed -i.bak -e "s|^priv_validator_key_file *=.*|#priv_validator_key_file = \"config/priv_validator_key\.json\"|" ~/.juno/config/config.toml

sed -i.bak -e "s|^priv_validator_state_file *=.*|#priv_validator_state_file = \"data/priv_validator_state\.json\"|" ~/.juno/config/config.toml

sed -i.bak -E "s|^(priv_validator_laddr\s=\s).*$|\1\"0.0.0.0:26659\"|" ~/.juno/config/config.toml

systemctl restart tmkms.service

systemctl restart junoval.service

Voilà you have just created a validator with cosmovisor, grafana, prometheus, tmkms, and yubihsm 2.

WARNING: make sure you move the priv_validator_key.json from ~/.juno/config/ to a usb and save that usb. If the priv_validator_key.json is left on the device, it defeats the purpose of tmkms and yubihsm.

MISC: if you want to see what your validator is doing, open two terminals and type

journalctl -u junoval.service -f # in one

journalctl -u tmkms.service -f # in the other

As always I’m open to improving this documentation and suggestions are welcome. One thing I’m unsure of is whether it is proper security protocol to leave the yubihsm seed phrase as the password in tmkms.toml.

Thanks for reading fren :)

--

--

rMalakaib
Oregon Blockchain Group

twitter: @rmalakaib | "Interchain Federalist — — Interop Optimist “ | I have perceiv’d that to be with those I like is enough…” — W.W.