Notes on syncing Ethereum nodes

Jeff Wu
Aleph Zero
Published in
8 min readMar 24, 2018

Running an Ethereum node is not necessary for participation with the network; most people probably transact using wallets like MyEtherWallet or Coinbase and simply look at EtherScan to see when their transactions have occurred. This is convenient and probably secure (I’m not a particularly paranoid person) but it’s not in the spirit of cryptocurrency.

Unlike miners who propose new blocks, each node on the network’s primary function is to propagate the blockchain. Nodes do not propose blocks, they pass along pending transactions and new blocks to their peers. The more nodes there are the more aggregate network bandwidth and more copies of the blockchain are out there. Although network size is generally measured by hash rate, the number of nodes and how actively they participate in the network is also a good measure of decentralization and network health.

Types of Sync Methods

Before getting into what happens during a sync, it’s important to note that there are three ways to get in sync on Ethereum.

  • Full Sync: a node that replicates the blockchain state from the genesis block. It downloads every transaction ever made on the network and computes the state.
  • Fast Sync: downloads block headers and a recent state trie from peers. Only validates state transitions after it’s been synced to a certain block height.
  • Warp Sync (Parity only): downloads a set of files from Parity peers that contain all block headers and the state trie at a certain block height. Significantly reduces the number of individual requests that need to be made over the network. Similar to Fast Sync, these nodes only start to validate state transitions after a certain block height.
  • Light Clients: are not covered in this post. Light clients do not store state and only fetch data from peers as needed.

What happens when you sync?

The goal of nodes is to store and propagate the blockchain. Therefore getting a valid copy is the most critical part of running a node. To do this on Ethereum you need to validate the blockchain headers and download a copy of the state trie.

Validating Blockchain Headers

An Ethereum blockchain headers have the roots of three Merkle trees in them. I’m going to call these “trie” interchangeably because that’s what’s done in the Ethereum documentation, just note that they mean basically the same thing. The transactions trie, the receipts trie and the state trie along with the previous block’s hash. There’s other data as well but we will ignore that for the purposes of this post.

A fully validating node would download all the transactions from peers and run each transaction in the order that they are declared on the block against the previous state trie. That would generate a set of transaction receipts for each transaction which have the intermediate state root in each receipt. This would prove to the node that all the Merkle roots are valid. Finally, the node would validate the Proof of Work on each block; validating that the miner who proposed the block actually did a proportionate amount of computation to the block’s difficulty.

A fast or warp sync node short cuts this process by just validating the Proof of Work on each block. This allows the node to skip the difficult task of downloading and validating all the transactions for every block. As noted in this Geth pull request, a Geth fast sync actually doesn’t even validate the Proof of Work on every block. It validates it on every N blocks and probabilistically guarantees that the headers are valid.

How does this method protect the node against malicious copies of the blockchain? Well remember that any PoW blockchain is only as secure as the amount of hash power invested in it. If an attacker wanted to send a series of corrupted blockchain headers to a node, they would need to invest a similar amount of hash power into creating that header as the actual network. Not only is that prohibitively expensive, it gets more expensive the further back in time that corrupted block exists. Without knowing anything about the actual transactions or state for a given block, a node can infer that it was a valid block simply by validating that the Proof of Work corresponds to the difficulty at the time. This only works if you believe that replicating the hash rate of the network is prohibitively expensive.

Understanding the State Trie

The state trie is one of the major technical differences between Ethereum and the Bitcoin blockchain. Understanding it is important to understanding the distinct scaling challenges that Ethereum presents.

Before getting into the state trie, it’s important to realize that each Ethereum node creates and maintains a database. Parity and Geth both use embedded key-value databases derived from LevelDB but any type of database could be used. All the data the node requires, including block headers, transaction tries, receipt tries, state tries and storage tries are in this database. One of the advantages of running a node is being able to inspect this database and do all sorts of nifty analytics on it.

In Ethereum, all nodes (at minimum) store the current state for all accounts that have ever transacted on the network in the state trie. Each account has a balance and a counter called a nonce that prevents double spending (i.e. transaction 1 must happen before transaction 2, etc.). If an account is a smart contract, it also has the keys to database locations where the code and the contract storage are located.

Syncing the State Trie

As you can imagine, as the network grows the state trie is getting quite large. Syncing the state trie is one of the major obstacles for people who want to run nodes. When you start a fast sync in Geth, it picks a block height to sync your state trie to; I’ll get into the reasoning for this in the next section. With the block height chosen, Geth starts sending out requests to peers for nodes in the state trie. At the time of writing, there are about 5.3 million block headers to download to get up to date and over 100 million state trie entries to download.

In the Ethereum P2P protocol, nodes regularly tell each other what block height they are at, however, they don’t communicate the size of their state trie. When syncing block headers you have a precise sense of the progress; syncing the state trie is a Sisyphusian task of watching the number creep higher and higher with no sense of where the end is.

The goal of this post is not to get into the state sync too much except to point out that it’s a major obstacle to getting more nodes on the network. My main takeaway from trying to run a Geth node is that the implementation of state sync may not be optimal. This is something I want to dig into more for a future post.

Validating the State Trie

After you’ve downloaded the state trie for a given block height, the question is how do you know if that state trie is valid? The state trie is the accumulated changes from all the transactions that have ever been performed on the network and a fast synced node has not processed any of those transactions.

If you trust that the block headers are valid and you believe in the mathmagic of Merkle trees, then you can trust that your state trie is valid. However, since the state trie you’ve downloaded is by definition close to the most recent block, it’s susceptible to a Sybil attack. In this attack, an attacker has sent you an “alternative facts” state trie and the block headers to go along with it. If the attacker can keep your node isolated and convince you to transact with them then they can succeed in some sort of double spend attack. You can think of this like the attacker is able to force a node into believing in a forked version of the network.

The simple way to counteract this is just to wait a number of blocks similar to waiting on confirmations on a transaction. An attacker would have to maintain a hash rate similar to the network in order to keep the node isolated. If it can’t keep that hash rate, eventually the node will start to see invalid blocks and realize something is wrong. Again, the hash rate of the network is what keeps it secure.

Thoughts on Running a Node

So this is the part of the post where it gets a little speculative. I’m sure there are smart people out there who have done a lot more research and thinking about this than me, but I want to try and outline some thoughts I have and hopefully present more detail in future posts.

Data Availability

As the Ethereum blockchain grows (both in users and simply over time), the number of nodes storing every transaction may become a smaller and smaller proportion of potential peers. In one sense, this doesn’t matter; trust in the hash rate and as long as block headers are valid then you can trust the chain. However, there is something unsettling about this. Full nodes have a monopoly on this transaction data and could decide to simply stop sharing it.

Data availability also poses logistical problems. If I want to sync the state trie at some random block height how many peers will be hosting that particular version? If I want to validate all the transactions for a particular account how many peers will hosting those transactions as well as the related transactions on its path back to the Merkle root?

These use cases aren’t purely theoretical. Analytical use cases for having historical transactions exist for law enforcement, valuation of tokens, academic research, etc.

Hardware Requirements

Currently nodes aren’t compensated for the value that they provide to the network. People run nodes for the value that they provide to the user and for the general benefit of the network. With proof of stake, this may change; however the hardware requirements for running a node still seem to be increasing. I’m interested in doing more benchmarks on the hardware requirements for running a node.

Parity vs Geth

One strength of the Ethereum ecosystem is multiple, parallel implementations of the protocol. Parity’s Warp Sync is an innovation that gets nodes up to sync much faster by creating checkpoints every 30000 blocks where nodes download block headers and state tries with much fewer network requests. Since validating block headers is what’s required to get a node up and running, it shouldn’t matter whether you download all the blocks at once or individually from multiple peers.

I was able to get a node up and running relatively easily with Parity using warp sync, whereas I struggled with Geth running into out of memory issues and database bloat. I’m going to dive into those issues in more depth in a future post.

Thanks for taking the time to read this post, if you enjoyed it give me a clap! I’m writing about Ethereum because I believe cryptocurrency can make the world a better place. Please comment if you disagree with anything or if I made any mistakes, I’d love to hear from you. For full disclosure, I hold Ethereum.

--

--

Jeff Wu
Aleph Zero

Co-Founder of Notional Finance, crypto fanatic, recovering data scientist, and renewable energy enthusiast.