Tame the Behemoth or how to run an archive blockchain node

Published in

Protofire Blog

14 min readJun 16, 2023

How-to: Run an Archive Blockchain Node and Tame This Beast

Agenda:
1. Introduction
2. Node Types by Usage
3. First Challenge: Disk Performance
3.1. How many IOPS should be there?
3.2. Filecoin RPC Endpoint as a Node Setup Example
4. Second Challenge: Archive Node’s Disk Size
5. Node Types by Ledger-state
6. Modern Blockchain Archive Node Size
7. Lessons Learned: Filecoin Archive Node
8. Go Right Through: Use io2 Block Express
9. Choose the Hard Way: gp3 RAID0 Volume
10. Multi-EBS Snapshot Creation and Recovery
11. The Comparison of io2 vs. gp3 Pricing
12. Conclusion
13. Authors

How-to Run an Archival Blockchain Node and Tame the Beast

1. Introduction

We at Protofire have been running nodes for various blockchains like Ethereum, Filecoin, and Avalanche since 2017. Through the years, we have gained tons of experience in handling technology and infrastructure-related issues. Today, we want to share our advice about the two most exhausting and painful challenges when running blockchain nodes, and how you can tame them to fit your project:

Disk performance
Disk size

Disk performance can be a bottleneck when processing a huge volume of requests. We will discuss it first. Disk size becomes a problem when your node needs to run at terabyte scale. This is true not only for the blockchain ecosystems and Web3, but for many Web2 areas too. Both topics have many pitfalls. With this article, we present to you our most recent use cases and provide best practices you can learn from.

Classic artworks like the one below by William Blake have inspired us, as software nodes that work with enormously large volumes of data can resemble beasts.

Behemoth and Leviathan by William Blake

Let’s tame the creatures!

First off, let’s brush up on the basics of node types which can each have diverse sets of side effects.

2. Node Types by Usage

Typical blockchain node types can be divided into:

RPC endpoints where users read blockchain information and submit transactions, usually via POST HTTP requests.
Validators that support a blockchain network by constantly creating new blocks, processing transactions, and getting rewards, often involving staking.
Custom configurations that work with blockchain data like indexers, p2p providers, etc. These nodes may suffer from diverse disk performance and disk size challenges due to a varied blockchain environment.

Let’s go through the named challenges and address them in the context of archival nodes.

3. First Challenge: Disk Performance

Blockchain nodes usually rely on fast, parallel read-and-write operations.
This means you cannot use cheap HDDs because they are slow in I/O operations. Solid-state drives (SSD) are the ideal weapon of choice.

Why an SSD? Imagine that you need to perform 100 read operations per second and also sync blockchain state to disk at the same time. If your disk I/O is slow, the blockchain client may lag behind when syncing to the current head of the chain and have outdated data from previous blocks. This can lead to these common problems depending on the node type:

Scenario 1. A user has sent a transaction from their wallet to the blockchain against a lagging RPC endpoint. In this case, their wallet will have an update delay and show an older balance. The user may get confused and try to resend the transaction. This may lead to funds being spent twice, which is usually unacceptable UX. Eventually, this may lead to an influx of angry users to your support channels. By all means, supporting timely user experience is one of the most crucial requirements of an RPC node.

Scenario 2. You are a validator in a blockchain network. The lag leads to your node missing the latest blocks and messages, and the network slashes your validator by some percent of your stake. You will have to wait in jail until you are included back in the active validator set. Undoubtedly, nobody wants to have their initial pledge slashed.

Scenario 3. You run a bridge, which is an example of a custom configuration that uses several RPCs across various blockchains, an entire bundle of separate nodes can stop working. Consequently, you’ll have to deal with downtime and angry users. Needless to say, UX will fall to the lowest grade.

3.1. How many IOPS should there be?

The volume of IOPS (input/output operations per second) depends on the chain, layer, and infrastructure combo used.

For example, your recommended disk for the Geth client for Ethereum should be able to provide around 10K of IOPS, according to the Geth’s hardware requirements and a Geth enthusiast’s gist by yorickdowne. (In AWS terms, you should use EC2 EBS-optimized by default or optimization-supported instances.)

However, sometimes even SSDs are not enough. In this case, you will have to use NVMe drives.

Learn how this works in a real-world use case below, where we describe how Protofire operates the most popular public RPC endpoints for Filecoin, Glif Nodes.

3.2. Filecoin RPC Endpoint as a Node Setup Example

Protofire operates an RPC endpoint for the Filecoin ecosystem at a scale of around 42 million requests per day or 472 requests per second (rps) with spikes of up to 1,500 rps. Our current setup includes the following types of AWS nodes (although we are also currently exploring other cloud providers):

r6gd.4xlarge — 20,000 IOPS
m5d.8xlarge — 30,000 IOPS
r5ad.8xlarge — 30,000 IOPS

These instances use Non-Volatile Memory Express aka NVMe disks, because this type of disk is physically attached to the instance machine and has the best IOPS performance. They can handle simultaneous writes to the state, and a huge number of reads from the disk.

Additionally, we horizontally scale this infrastructure because in the RPC case, many parallel operations are necessary, and traffic should be load-balanced. Disks, in this case, become a solid foundation for the entire high-load system.

Wondering how we manage the setup? See our open-source Terraform code.

Our goal is to get the highest rps without failures. To reach it, we always think about the input requirements to the node and choose the disk type respectively.

Now let’s discuss disk size.

4. Second Challenge: Archive Node’s Disk Size

Here comes the second challenge: disk size and problems attached to it.

Distributed ledger blockchain nodes are usually moderate in terms of space usage, e.g. a full Bitcoin ledger is around 568 GB. However, things get trickier when the blockchain has a distributed state machine like Ethereum.
And here starts the rabbit hole…

5. Node Types by Ledger-state

What is state? State is a complex collection of data that includes information about all the accounts and balances, as well as the current machine state, which can vary with each block under established rules.

It should be noted here that nodes differ by ledger-state usage:

Light nodes do not store any ledger or state and mainly store block header data. They are easy to spin up, however they should be connected to a full node.
Full nodes store only a snapshot of current state needed to validate new transactions. Blocks no longer needed can be constantly pruned:
In Ethereum this includes the last 128 blocks which occupy around 650 GB at the beginning. Constant pruning brings the total storage back down to the original 650 GB.
In Filecoin this includes the last 2000 epochs or last 16.67 hours of chain data.
Archive nodes store all data starting from a given point and stores the history of all blockchain states.
A full archive node normally stores all data beginning with the chain’s genesis, including full chain history and entire block trace.
A partial archive node is also possible, starting from a particular date / block / epoch.

Let’s understand the difference in the simple example. Say you need to know the wallet balance at block N. How nodes handle this in the Ethereum:

A light node will use the state of a full node it’s connected to as a source of truth.
Full nodes will try to get this info from the state on disk, which may only include recent state.
A full archive node will simply go to disk, get the information, and give it to the user. However these nodes need to store the entire state on disk.

Storing state on disk is where the behemoth is hidden. This can cause a lot of challenges for node operators, because of the sheer size that archival nodes can grow to.

6. Modern Blockchain Archive Node Size

As explained above, archive nodes need to store historical state on disk. However, disk space is often a bottleneck for archive node operators. Why? Take a look at the graph below to understand the challenge.

These are the sizes of some full ledger-state chains aka archive nodes as of last year, 2022 (source):

Solana: ~100TB
Filecoin: ~37 TB
Harmony: ~20 TB
Ethereum: ~13 TB
Polygon: ~16 TB
BNB Smart Chain: ~7 TB
Fantom: ~4 TB
Avalanche: ~3 TB

As you can see, this graph is TB scale, and this often causes problems, because cloud providers have the following limitations for disk size:

AWS — 16 TB for EBS GP3
AWS — 64 TB for EBS IO2 Block Express
Azure — 32 TB for Disk storage
GCP — 64 TB for Persistent Disk
OVH — 4 TB for Block Storage

Another possible solution is to use logical disk, but you need to know how to handle them correctly on your platform. The Protofire team has faced many pitfalls on the way while handling historical blockchain state. Now we want to share our knowledge of how we’ve done it for Filecoin in a case study.

7. Lessons Learned: Filecoin Archive Node

To set the stage, this was the context of our RPC operations for Filecoin:

AWS as a cloud provider
Archival chain state is 37 TB and growing

As with everything in the AWS toolset, the problem of huge AWS disks can be solved in two ways:

The easy way: use io2 Block Express EBS disks.
The hard way: experiment with Multi-EBS RAID0 and conclude whether it fits into the setup or not.

So let’s take a look at both solutions one by one.

8. Go Right Through: Use io2 Block Express

Using io2 Block Express was the easiest way to handle growing blockchain state, but it has several limits and disadvantages.

io2 Block Express was our first go-to idea when we had to handle more than 16 TB of data. This is because of its clear advantages:

An io2 Block Express disk has 64 TB as a limit, which is certainly better than 16 TB on gp3.
We can adjust the IOPS level according to our needs.
AWS snapshots are available as-is for this disk type.

However, we had a number of concerns due to io2 Block Express’ disadvantages:

The max size is limited to 64TB. This is nice. But not enough for behemoths like archival Solana with 100 TB+.
The instance types were an issue because you must use the following instance families: C6a, C6in, C7g, Inf2, M6a, M6in, M6idn, M7g, R5b, R6in, R6idn, R7g, Trn1, Trn1n, X2idn, X2iedn.
The region setup was also time-consuming: East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), and Europe (Stockholm).
Last but not least, the price for io2 Block Express disk usage is higher than that for gp3. For pricing, see the table below.

But earlier we mentioned the 16 TB limit for the EBS gp3 disks.

So this is where the need for RAID0 kicked in, in terms of the comparison in price and disaster-recovery strategy.

9. Choose the Hard Way: gp3 RAID0 Volume

Cost was the main motivator for us to try the RAID0 setup.

We also read the great AWS user guide and blog on RAID0 top to bottom on how to use it:

We thoroughly examined the tool, assessed it for the risks involved, and outlined a precise to-do list with the action items for the team to handle RAID0.

Here are the steps that we took:

1. We used the Terraform code for the EBS volumes creation and update of the disk count and disk size. Here, see an example of a RAID0 substrate with 4 gp3 EBS volumes, where each EBS volume allows for the following:

10,240 GB space
3,000 IOPS
125 MB/s throughput
Two tags:

tenant tag — used as connecting point between RAID0 disks and EC2 instance

PartNumber tag — sequence the number of the disk in the RAID0

2. We applied an awesome user-data bash script that handles the creation and/or resizing of RAID0, where you use the tenant tag as input. Here is an example of its usage in our current setup. The script does the following:

It attaches EBS disks via the tenant tag.
Creates Logical Volume Mount (LVM) RAID0.
And finally, mounts logical disks to the RAID0.
WARNING: the following two actions are performed manually if needed because user data runs only on the EC2 creation:
Resize LVM RAID0 if the number of EBS volumes is extended.
Extend LVM RAID0 if the number of disks increases.

3. And the crucial part of the setup is the code for the snapshot creation across multiple Amazon EBS volumes:

It creates simultaneous snapshots across several EBS with the same tenant tag.
Takes a snapshot daily.
Stores snapshots for 7 days only.

Now, a few words must be said about the multi-EBS snapshot creation and recovery because they complicate the solution even more.

10. Multi-EBS Snapshot Creation and Recovery

If you have a multi-EBS volume in a production environment, you should initialize all disks in the logical volume.

In our case, we had to initialize 4 gp3 EBS disks in RAID0 instead of one io2 Block Express EBS volume.

AWS has a solution for the initialization problem which is the Amazon Data Lifecycle Manager. The tool allows you to create a snapshot of multiple drives at the very same time, so the RAID arrays can always stay consistent. Additionally, if you combine it with Amazon EBS fast snapshot restore, it will eliminate the latency of I/O operations on a block when it is accessed for the first time.

Volumes created using fast snapshot restore instantly deliver all of their provisioned performance. Thus, it takes 60 minutes per TiB to optimize a snapshot on one disk.

Next, the more, the merrier — more disks means less time required to create an optimized snapshot and uses this formula:

Time =Size / Number

Now, let’s review the truly vital part of the comparison — the pricing difference between multi-EBS gp3 volume and io2 Block Express.

11. The Comparison of io2 vs gp3 Pricing

We use the following inputs for both types of disk (FYI — the disk pricing table is here):

1 month as the pricing period
io2 Block Express — one disk 40 TB (40,960 GB)
gp3–4 disks 10TB — 10240GB each
3,000 IOPS per disk
Standard disk throughput
gp3–150 MB/s
io2 Block Express — 4000 MB/s
Snapshot pricing per GB at $0.05/GB-month
We are storing snapshots for 30 days for this example
Delta of new/changed data daily is 150GB — this is important to know because AWS charges you at full price for the 1st snapshot, and only for deltas in the subsequent ones
1 Availability Zone
Fast snapshot recovery price is $0.75 per 1 DSU hour on each snapshot and in each AZ
Fast snapshot recovery takes 60 minutes per TiB to optimize a snapshot on one disk

As you can see in the table, all categories are pretty much the same except for the disk pricing where gp3 is 40% cheaper than io2 Block Express.

The time to finish initialization of fast recovery is faster for the multi-EBS than io2 Block Express as we described in the snapshotting part. Formula, again:

Time =Size / Number

The nice cherry on top for gp3 in multi-EBS mode is the IOPS bandwidth, because RAID0 bandwidth is the sum of the IOPS of the separate disks.

All-in-all, io2 Block Express appears to be a great fast go-to solution if you want to implement results quickly. However, if you have longer-term plans to grow, gp3 is a bit painful to implement but the best tool to go with. Don’t worry, it pays off.

12. Conclusion

Now that we’ve made our way through the basics of archive node types and gained an understanding of the two crucial node-specific challenges of disk performance and disk size, we can conclude that in the RPC endpoint case, you should use an NVME disk attached to the instance to handle the high amount of requests per second.

It is reasonable to conclude that gp3 in multi-EBS is the better option for us in the case of blockchains with archival history at high TB scale. Multi-EBS gp3 volume proved to be more cost-effective, much faster in recovery, and more ‘spacious’ than io2 Block Express.

In our case, the biggest challenge was to deepen our experience with RAID0 and learn all nuances of supporting it.

As a reminder, here’s what we used for our disk setup:

gp3 in multi-EBS mode which is 40% cheaper than io2 Block Express.
Snapshots have the same price.
Fast recovery for the same price, yet the recovery speed is faster for the gp3 multi-EBS.
For engineers, user experience with io2 Block Express is better: once you connect it, it’s ready to create, update, and make snapshots, e.g. using ebs-csi-driver.
gp3 in a multi-EBS mode requires a deeper understanding of multiple areas like RAID0 setup, snapshots, fast recovery, automation for routine operations, monitoring, and support.
In practice, the space limit for gp3 in multi-EBS is around 8*16=128TB according to the guide. In theory, it can reach 27*16=432TB.
Space limit for io2 Block Express is only 64TB. This space is enough for some cases and can be a good starting point for a project.
The gp3 multi-EBS RAID0 bandwidth is the sum of the volumes bandwidth IOPS.

So that’s it.

If you’re making your first steps in Web3 and feel unsure whether you’d ever even need 30TB — well, think again, and if you still think so — then go for io2 Block Express.

But if you’re looking for a perfectly balanced tool to run a behemoth archive blockchain node and looking for competitive cost, disk parameters, and overall performance, we recommend investing your time and effort into multi-EBS LVM RAID0.

Thanks for reading! Should you need help with your infrastructure architecture, CI/CD, or monitoring, or are interested in talking with us, ping us on socials or drop an email to devops@protofire.io. Don’t worry, we are ready for the challenges.

Cheers!

13. Authors

Uladzislau Muraveika — Lead DevOps Engineer @ Protofire
Ales Dumikau — DevOps Engineer @ Protofire