A recurring Ethereum discussion topic is the storage requirements for running an Ethereum node. Some will say that an Ethereum node uses several terabytes of storage while others insist that it is much less.
Truth is, a specific type of Ethereum Nodes, the archive node, is actually taking more than 2.3 TB of space (you can track their size here), while another type, the full node, is using less than 1/10 of the space, about 180 GB. But wait, how can you say it’s full while there are nodes that are fuller? This is the source of great confusion, and others gave detailed technical explanations that I won’t repeat here. But in a nutshell, full nodes have all history, of every block and every transaction, all fully validated, while archive nodes have all this plus the intermediary state of every account and contract for every block since genesis.
Why an archive node?
Most users don’t need an archive node. You may check your balances, sign and send transactions and even look at current Dapp data. What is missing is historical state data: you cannot, for instance, check your Ether balance from last month but you can list all the ERC20 transactions you have made. All the data is there, but it would be too slow to extract. The archive node will speed up the process by storing intermediate states, working like a cache.
Block explorers, Dapp dashboards, some wallet vendors, and chain analytics firms operate archive nodes. I personally run my own node to be able to quickly extract blockchain data necessary for analytics reports. I could use infura.io, but I find it too slow for my needs.
Some less-informed people are regularly propagating falsehoods, such has:
- Only archive nodes have all blockchain data.
- An archive node needs another archive node to be able to sync.
- There are only a handful of archive nodes on the Ethereum network .(some are saying there is only one!)
- You need a very powerful server to run a full node.
- Nodes that are synced in warp or fast mode are not full nodes.
So in attempt to settle this once and for all, I devised a little experiment that should prove that all the above statements are false: I will sync an archive node using only a warp-synced node. In other words, I will reconstruct 2.36 TB of archive data using only 180 GB of data as an input.
And I will do all this using an old PC at home, in my basement.
- Dell Optiplex 7020 (released on 2014)
- 16 GB RAM
- i5–4590 CPU @ 3.30GHz
- (1) 2TB MX300 and (1) 2TB MX500 SATA SSD, configured in LVM Stripping mode
- Ubuntu 18.04.2 LTS
- Parity Ethereum Client 2.3.5
- Docker 18.09.02 to run both nodes on the same server and allow fine-grained network control.
At current market price, this PC can be built for about USD 850.
Phase 1: Syncing the full node
To begin, I launched the full node with default settings and let it connect to the internet. It quickly discovered peers and started a warp-sync.
Parity warp-sync is very performant, and, just 90 minutes later, the node was synced to block 7,357,881, the latest at that moment.
Phase 2: Starting the archive node
Then I created the archive node, making sure it would not be able to talk to any other Ethereum node other than the full node. A private, internal, network was created within Docker to allow communication between the local nodes while preventing all communication between the archive node and the internet.
To make things a little bit more interesting, I also disconnected the full node from the internet. So at this point, the full and the archive nodes are completely isolated from the internet.
To prove it, let’s look at the full node logs:
We can see in the above logs that the full node was in sync until 11:00:20, at block 7,366,747, when I disconnect the internet (0 peers). Then I launched the archive node at 11:04:21, with the following parameters:
--bootnodes enode://firstname.lastname@example.org:30303 --pruning archive --no-periodic-snapshot --cache-size-db 6000 --cache-size-state 1000
- - -bootnodes: tells the archive node how to connect to the full node.
- - -pruning archive: run in archive mode
- - -no-periodic-snapshot: prevents the node from creating snapshots for other nodes to warp-sync. Better leave this out if you want to help the Ethereum network, but I found this setting to consume a lot of resources for archive nodes. Runs fine on regular full nodes.
- - -cache-size-db and - -cache-size-state: Indicate the amount of RAM to use for rocksdb cache and state cache. These settings ran fine for me, but your millage may vary.
Phase 3: Syncing the archive node
The archive node is now syncing, at more than 1000 blocks per seconds nonetheless!
This won’t last. The first blocks are almost empty. Very quickly the archive node hit larger blocks and more complex transactions, including the spam blocks between approximately blocks 2,239,000 and 2,730,000.
Nonetheless, after 25 days, I finally reached the sync stage!
The archive node reached the same block height as the full node (7,366,747), remember if it disconnected from the internet and didn’t process any new blocks since March 14th. Then I proceeded to reconnect the network of the full node so it could resynchronize and, 3 days later, I finally managed to sync with the main network:
Things were progressing very quickly until about block 4,000,000 and then slowed down considerably. The rate of block processing became relatively constant starting from block 4,750,000.
If we graph the gas processed per second over the 28 days, we can see a constant decrease in performance, so that would indicate that unless improvement can be made to the Parity Ethereum client, syncing an archive node with Parity will become increasingly difficult over time.
I placed all logs and more configuration details in a Github repo in case someone would want to review the data, compare the performance of their nodes or study potential improvements to Ethereum clients.
I believe this demonstrates without a doubt that an archive node is an expanded version of the full node, and that the latter has all the necessary information to secure the network.
Syncing an archive Ethereum node is certainly not a pleasant experience: it is painfully slow and, because it slows down over time, you may wonder if it will ever complete. In fact, if you do not have the right hardware or settings you might not be able to complete it at all. Make sure your storage gives you plenty of IOPS.
For the purpose of this demonstration, I used a low-end PC and consumer-grade storage to show that even if the process is slow, it is within the reach of the individual Ethereum user. Serious users will want to build redundancy, setup automated database backups and consider using high performance storage.