How to (not) run an Ethereum Archive Node — A Journey

Published in

slock.it Blog

7 min readJan 18, 2019

Here at Slock.it we provide RPC services for our developers.

You can think of it as an internal Infura.

We decided to run at least one archive node per Ethereum blockchain that we work with regularly.
That approach worked fine for smaller chains like Kovan, Ropsten or Tobalaba, but the Mainnet gave us a real headache.

For those chains, running an archive node just meant, starting Parity with the proper command line switches, waiting a few days and we’re ready to let the developers use it.

What is an archive node, anyways?

Before we start our journey I want to specify what an “archive node” actually is for us.

tl;dr; it’s — pruning=archive — fat-db=on — tracing=on (using parity)

But what does this all mean? Is it important to enable all this?

For normal blockchain operations where you don’t want to dig into history or have all the data available at your fingertips, you don’t need an archive node.

So most users will be satisfied with a warp-synced pruned node, but if you’re doing R&D, it really helps to have everything readily available.

So let’s analyze the command line from above:

— pruning=archive -> This directs Parity to maintain all the states in the state-trie. Normally it only keeps the states for the last few blocks.
So every time the global state changes (a balance changes or the storage of a contract), it is recorded on-chain. So it becomes quite large over time.
— fat-db=on -> this doesn’t mean we put Parity’s internal database on a fast food only diet, but it has the same effect. It will roughly double the amount of data stored in the state database. The reason for this is that it stores additional information to be able to enumerate all accounts and storage keys that are on the chain.
— tracing=on -> This enables Parity’s transaction tracing by default, and you can get the EVM trace of each transaction without having to replay it.

As you see, much of this is just trading disk space for expensive computation.

To clarify for the “Ethereum full nodes need 2TB maximalists”. You can think of an archive node as a full node with a super heavy cache enabled. A normal full node, such as the pruned parity node, has all data needed to recompute all the data of the archive node, we just want it faster so we store and compute absolutely everything.

Now that we know what an archive node is and what it’s good for, let’s see how we can run one.

Episode I to III: The (bad) prequels

From the word on the Ethereum streets, we knew that an archive node would probably require a couple hundred gigabytes of storage and should be on SSDs and not on spinning rust.

That was easy to arrange. We got a hold of the most bang-per-buck SSD VPS we could find.

The Specs looked beefy enough: 10 Cores, 50 GB RAM and a whopping 1.2 TB of SSD storage. Should be plenty, right?

After we had started Parity with the flags mentioned above, we sat down and the first couple of thousand blocks flew by. 1000 blocks per second. Nice!

After about 3–4 days (this number comes from my faded memory, so don’t nail me down on it) we hit the Shanghai DoS attack and dropped to 10 blocks per second.

Meh!

We let it run a few more days; and after 2 weeks (still syncing) we got a warning from our monitoring about free disk space.

If I remember correctly, we were at 4 to 4.5 million blocks (or was it only 3.something?) — doesn’t matter really. But what mattered was the fact that with only slightly more than half the chain synced, we nearly reached the capacity of 1.2 TB.

So we looked around if there was any other provider that would give us terabytes of fast SSD storage on the cheap.

Sure, we could just have went to AWS or Azure and paid them tons of blockchain money, but this would have meant waiting longer for the Lambos. No way! Lambos are important!

So we came up with this interesting idea of using a jerry rigged hybrid solution.

(Disclaimer: Yes, I know we could have dug through the code of Parity and RocksDB to find out that this was kind of a stupid idea, but where’s the fun in that?)

The basic assumption of this idea was: All this data is there for historical purposes. Parity shouldn’t touch “old” data and should only work with the latest states.

This meant that we came up with a plan to build a hybrid storage solution from cheap VPS’s by using an SSD based VPS with a smaller SSD and then a large SAS RAID-based HDD system with enough storage for the (what we presumed) cold data.

We created this using an overlay file system where we mounted the volume of the SAS HDDs over the network as read-only and then overlaid the local SSD storage as an R/W layer.

This means that every write goes to the SSD and every read is tried on the SSD first and if it’s not there, is read from the network storage.

So far so good. After we managed to get the hybrid running (this topic could easily fill another blog post), Parity worked just fine and started to sync blocks.

After a while, it started reading an awful lot of data from the network disk. We let it continue. Over time it ramped up in blocks/second sync speed, but then the smaller SSD (it was not on the aforementioned SSD VPS for various reasons) filled up and I had to move the newer files from SSD to HDD to make room for new stuff. It got a nice daily routine. Stop Parity, move files, start again, wait for monitoring to complain again, repeat.

But as it chewed through the blocks, it started to get painstakingly slow; to a point where the sync speed dropped to only 1–3 blocks per MINUTE. No chance we could catch mainnet with that.

We said: Ok this is pointless. Running an archive node on an inexpensive VPS was not an option.

Episode IV: A new hope!

After these attempts, we decided to build our own machine that would run our mainnet archive node.

The Ingredients for our mainnet archive node

From our experience, the main bottleneck seemed to be iops and disk space.

And these were the final specs:

4x 1TB Samsung 960 EVO NVMe SSDs
1x Highpoint NVMe Raid card
A Ryzen 5 processor
16 gigs of ram
and a small 240 gig boot SSD

The RAID Card with all 4 SSDs installed — An I/O Beast

The Samsung SSDs were put into a ZFS raidZ configuration, so even if one died we’d still have the chain and we would have almost 3TB of NVMe SSD space for the chain.

So we started Parity (as the only process on this box) and let it rip through the chain. The sync speeds were quite good. 10 to 20 times increase to the first VPS.

But even after 3 weeks it was still not finished…

Episode V: The Empire strikes back

At the moment we’re still ~900k blocks behind the main chain (which was at around block 6,600,000 when I was writing this post). sync rates have dropped to a measly 1–1.5 blocks/second since the high-volume blocks of early 2018.

It is quite a mystery why it is so slow. Disk I/O is at 160MB/s read throughput all the time (the NVMe RAID can easily output 2GB/s random reads) and very low CPU usage.

One explanation we came up with is that RocksDB has now 1.2TB to juggle with and Parity may need to read from that dataset once every block. It seems that each query has to crunch through a huge index to find what it’s looking for.

We thought we were defeated! We thought a parity node with all the state history will not happen.

Episode VI: Return of the Jedi

We abandoned the idea for a couple of weeks. But then Afri Schoedon started writing about the differences of full vs. archived nodes and that is still possible to run an archived node. One just has to be patient. Very patient.

So we took another go at this and resumed our archive sync in early December 2018. And what do you know? a couple days before Christmas we finally achieved sync! The gift of archive sync!

So it is possible to have an archived Ethereum node running and Iguess you can get away with lower specs as well.

We accumulated 1.94 TB on our ZFS. And we have ~550GB to spare.
Also, we started the sync in October (but had some weeks where we paused).

So it’s not a matter of hours or days. It’s a matter of weeks, but as stated above: Unless you really need all the states or a “because we can” you’ll be much faster with a full node using the pruned states.